estuscience - se

Eskişehir Technical University Journal of Science and Technology A - Applied Sciences and Engineering

2667-4211

Eskisehir Technical University

10.18038/estubtda.1891746

Artificial Intelligence (Other)

Yapay Zeka (Diğer)

MACHINE LEARNING AND VALIDATION STRATEGIES IN PANEL DATA-BASED GREENHOUSE GAS EMISSION MODELING

https://orcid.org/0000-0002-4280-0394

Demircioğlu Diren

Deniz

SAKARYA ÜNİVERSİTESİ

03 27 2026

27 1 204 219 02 17 2026 03 19 2026

2000

Eskişehir Technical University Journal of Science and Technology A - Applied Sciences and Engineering

In this study, sector-based methane emissions of European countries were modeled using a Random Forest–based machine learning approach applied to a panel dataset covering the period 2014–2023 with country–sector–year dimensions. The primary objective of the study is not to maximize predictive accuracy, but to evaluate how different validation strategies affect model performance and generalization behavior. Accordingly, three validation strategies—random training–test split, temporal (time-based) validation, and country-based group validation—were comparatively analyzed. The dataset, obtained from Eurostat, comprises 29 countries, 5 sectors, and 1,449 observations. Model performance was evaluated using root mean square error and the coefficient of determination. Under random splitting, the model achieved very low errors (mean RMSE = 0.0126 ± 0.0025; mean R² = 0.9993 ± 0.0003), although these results may be optimistic due to information leakage. Temporal validation yielded stable near-future performance (RMSE = 0.0225, R² = 0.9975). In contrast, country-based group validation resulted in a substantial performance decline (average RMSE = 0.3132 ± 0.4061), indicating strong cross-country heterogeneity. Overall, the findings demonstrate that, in panel data settings, the choice of validation strategy is as critical as the machine learning algorithm for realistic generalization assessment.

Panel data Machine learning Validation strategies Greenhouse gas emissions

No external funding was received for this study.

[1] World Meteorological Organization. WMO Greenhouse Gas Bulletin No. 19: The State of Greenhouse Gases in the Atmosphere. World Meteorological Organization; 2023. Accessed: December 14, 2025. https://bpb-us-w2.wpmucdn.com/blog.nus.edu.sg/dist/0/15540/files/2019/11/ghg_bulletin_en.pdf

[2] World Meteorological Organization. State of the Global Climate 2021. World Meteorological Organization; 2022. Accessed: February 14, 2025. https://wmo.int/resources/publication-series/state-of-global-climate/state-of-global-climate-2021

[3] Gan N, Zhao S. Global greenhouse gas reduction forecasting via machine learning model in the scenario of energy transition. J Environ Manage 2024;371:123309.

[4] Eurostat. Greenhouse gas emissions by source sector. Eurostat; 2024. Accessed October 09, 2025. https://ec.europa.eu/eurostat

[5] UNFCCC. Greenhouse Gas Inventory Data – Time Series. UNFCCC; 2025. Accessed January 05, 2025. https://di.unfccc.int/time_series

[6] Crippa M, Solazzo E, Huang G, Guizzardi D, Koffi E, Muntean M, Schieberle C, Friedrich R, Janssens-Maenhout G. High resolution temporal profiles in the Emissions Database for Global Atmospheric Research. Sci Data 2020; 7(1):121.

[7] Wood R, Neuhoff K, Moran D, Simas M, Grubb M, Stadler K. The structure, drivers and policy implications of the European carbon footprint. Clim Policy 2020; 20(1), S39-S57.

[8] Marotta A, Porras-Amores C, Rodríguez Sánchez A, Villoria Sáez P, Maser G. Greenhouse gas emissions forecasts in countries of the european union by means of a multifactor algorithm. Applied Sciences 2023;13(14), 8520.

[9] Ene Yalçın, S. Development of a Forecasting Framework Based on Advanced Machine Learning Algorithms for Greenhouse Gas Emissions. Systems 2024; 12(12): 528.

[10] Berrington A, Halpin B, Wiggins R. An overview of methods for the analysis of panel data. NCRM Methods Review Paper NCRM/007. National Centre for Research Methods. 2006. Accessed March 14, 2026. https://eprints.ncrm.ac.uk/id/eprint/415/1/MethodsReviewPaperNCRM-007.pdf

[11] Wooldridge JM. Econometric analysis of cross section and panel data. MIT press; 2010

[12] Athey S, Imbens GW. Machine learning methods that economists should know about. Annu Rev Econ 2019; 11(1): 685-725.

[13] Bakay MS, Ağbulut Ü. Electricity production based forecasting of greenhouse gas emissions in Turkey with deep learning, support vector machine and artificial neural network algorithms. J Clean Prod 2021; 285: 125324.

[14] Aksu İÖ, Demirdelen T. The new prediction methodology for CO2 emission to ensure energy sustainability with the hybrid artificial neural network approach. Sustainability 2022; 14(23): 15595.

[15] Uzel H, Alpsalaz F, Aslan E, Özüpak YA. Comprehensive Benchmark Of Linear And Ensemble Machine Learning Models For Global Co₂ Emission Forecasting. Middle East J Sci 2025; 11(2): 247-262.

[16] Cerqua A, Letta M, Pinto G. On the (Mis) Use of Machine Learning With Panel Data, Oxf Bull Econ Stat 2025; 0:1-13. doi:10.1111/obes.70019.

[17] Tian Y, Ren X, Li K, Li X. Carbon Dioxide Emission Forecast: A Review of Existing Models and Future Challenges. Sustainability 2025;17(4): 1471.

[18] Breiman L. Random forests. Machine Learning 2001; 45(1): 5–32. https://doi.org/10.1023/A:1010933404324

[19] Kohavi R. A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI) 1995; Montreal, Canada. 1137-1145.

[20] Roberts DR, Bahn V, Ciuti S, Boyce MS, Elith J, Guillera-Arroita G, Hauenstein S, Lahoz-Monfort JJ, Schröder B, Thuiller W, Warton DI, Wintle BA, Hartig F, Dormann CF. Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Ecography; 2017; 40(8): 913–929. https://doi.org/10.1111/ecog.02881

[21] Eurostat. Greenhouse gas emissions by source sector (dataset env_air_gge). European Commission; Published 2025. Accessed February 14, 2025. https://ec.europa.eu/eurostat/databrowser/view/env_air_gge/default/table

[22] Hancock JT, Khoshgoftaar TM. Survey on categorical data for neural networks. J Big Data 2020; 7: 28. https://doi.org/10.1186/s40537-020-00305-w

[23] Bergmeir C, Benítez JM. On the use of cross-validation for time series predictor evaluation. Inf Sci; 2012; 191: 192–213. https://doi.org/10.1016/j.ins.2011.12.028

[24] Joseph VR. Optimal ratio for data splitting. Stat Anal Data Min 2022; 15(4), 531-538.

[25] Tashman LJ. Out-of-sample tests of forecasting accuracy: An analysis and review. Int J Forecast 2000 16(4), 437–450.https://doi.org/10.1016/S0169-2070(00)00065-0

[26] Valavi R, Elith J, Lahoz-Monfort JJ, Guillera-Arroita G. blockCV: An R package for generating spatially or environmentally separated folds for k-fold cross-validation. Methods Ecol Evo 2019; 10(2): 225–232. https://doi.org/10.1111/2041-210X.13107

[27] Chai T, Draxler RR. Root mean square error (RMSE) or mean absolute error (MAE)? Arguments against avoiding RMSE. Geosci Model Dev 2014; 7(3): 1247–1250. https://doi.org/10.5194/gmd-7-1247-2014

[28] Chicco D, Warrens MJ, Jurman G. The coefficient of determination R² is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation. PeerJ Comput Sci 2021; 7: e623. https://doi.org/10.7717/peerj-cs.623