Research Article
BibTex RIS Cite

MACHINE LEARNING AND VALIDATION STRATEGIES IN PANEL DATA-BASED GREENHOUSE GAS EMISSION MODELING

Year 2026, Volume: 27 Issue: 1 , 204 - 219 , 27.03.2026
https://doi.org/10.18038/estubtda.1891746
https://izlik.org/JA87BX42RJ

Abstract

In this study, sector-based methane emissions of European countries were modeled using a Random Forest–based machine learning approach applied to a panel dataset covering the period 2014–2023 with country–sector–year dimensions. The primary objective of the study is not to maximize predictive accuracy, but to evaluate how different validation strategies affect model performance and generalization behavior. Accordingly, three validation strategies—random training–test split, temporal (time-based) validation, and country-based group validation—were comparatively analyzed. The dataset, obtained from Eurostat, comprises 29 countries, 5 sectors, and 1,449 observations. Model performance was evaluated using root mean square error and the coefficient of determination. Under random splitting, the model achieved very low errors (mean RMSE = 0.0126 ± 0.0025; mean R² = 0.9993 ± 0.0003), although these results may be optimistic due to information leakage. Temporal validation yielded stable near-future performance (RMSE = 0.0225, R² = 0.9975). In contrast, country-based group validation resulted in a substantial performance decline (average RMSE = 0.3132 ± 0.4061), indicating strong cross-country heterogeneity. Overall, the findings demonstrate that, in panel data settings, the choice of validation strategy is as critical as the machine learning algorithm for realistic generalization assessment.

References

  • [1] World Meteorological Organization. WMO Greenhouse Gas Bulletin No. 19: The State of Greenhouse Gases in the Atmosphere. World Meteorological Organization; 2023. Accessed: December 14, 2025. https://bpb-us-w2.wpmucdn.com/blog.nus.edu.sg/dist/0/15540/files/2019/11/ghg_bulletin_en.pdf
  • [2] World Meteorological Organization. State of the Global Climate 2021. World Meteorological Organization; 2022. Accessed: February 14, 2025. https://wmo.int/resources/publication-series/state-of-global-climate/state-of-global-climate-2021
  • [3] Gan N, Zhao S. Global greenhouse gas reduction forecasting via machine learning model in the scenario of energy transition. J Environ Manage 2024;371:123309.
  • [4] Eurostat. Greenhouse gas emissions by source sector. Eurostat; 2024. Accessed October 09, 2025. https://ec.europa.eu/eurostat
  • [5] UNFCCC. Greenhouse Gas Inventory Data – Time Series. UNFCCC; 2025. Accessed January 05, 2025. https://di.unfccc.int/time_series
  • [6] Crippa M, Solazzo E, Huang G, Guizzardi D, Koffi E, Muntean M, Schieberle C, Friedrich R, Janssens-Maenhout G. High resolution temporal profiles in the Emissions Database for Global Atmospheric Research. Sci Data 2020; 7(1):121.
  • [7] Wood R, Neuhoff K, Moran D, Simas M, Grubb M, Stadler K. The structure, drivers and policy implications of the European carbon footprint. Clim Policy 2020; 20(1), S39-S57.
  • [8] Marotta A, Porras-Amores C, Rodríguez Sánchez A, Villoria Sáez P, Maser G. Greenhouse gas emissions forecasts in countries of the european union by means of a multifactor algorithm. Applied Sciences 2023;13(14), 8520.
  • [9] Ene Yalçın, S. Development of a Forecasting Framework Based on Advanced Machine Learning Algorithms for Greenhouse Gas Emissions. Systems 2024; 12(12): 528.
  • [10] Berrington A, Halpin B, Wiggins R. An overview of methods for the analysis of panel data. NCRM Methods Review Paper NCRM/007. National Centre for Research Methods. 2006. Accessed March 14, 2026. https://eprints.ncrm.ac.uk/id/eprint/415/1/MethodsReviewPaperNCRM-007.pdf
  • [11] Wooldridge JM. Econometric analysis of cross section and panel data. MIT press; 2010
  • [12] Athey S, Imbens GW. Machine learning methods that economists should know about. Annu Rev Econ 2019; 11(1): 685-725.
  • [13] Bakay MS, Ağbulut Ü. Electricity production based forecasting of greenhouse gas emissions in Turkey with deep learning, support vector machine and artificial neural network algorithms. J Clean Prod 2021; 285: 125324.
  • [14] Aksu İÖ, Demirdelen T. The new prediction methodology for CO2 emission to ensure energy sustainability with the hybrid artificial neural network approach. Sustainability 2022; 14(23): 15595.
  • [15] Uzel H, Alpsalaz F, Aslan E, Özüpak YA. Comprehensive Benchmark Of Linear And Ensemble Machine Learning Models For Global Co₂ Emission Forecasting. Middle East J Sci 2025; 11(2): 247-262.
  • [16] Cerqua A, Letta M, Pinto G. On the (Mis) Use of Machine Learning With Panel Data, Oxf Bull Econ Stat 2025; 0:1-13. doi:10.1111/obes.70019.
  • [17] Tian Y, Ren X, Li K, Li X. Carbon Dioxide Emission Forecast: A Review of Existing Models and Future Challenges. Sustainability 2025;17(4): 1471.
  • [18] Breiman L. Random forests. Machine Learning 2001; 45(1): 5–32. https://doi.org/10.1023/A:1010933404324
  • [19] Kohavi R. A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI) 1995; Montreal, Canada. 1137-1145.
  • [20] Roberts DR, Bahn V, Ciuti S, Boyce MS, Elith J, Guillera-Arroita G, Hauenstein S, Lahoz-Monfort JJ, Schröder B, Thuiller W, Warton DI, Wintle BA, Hartig F, Dormann CF. Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Ecography; 2017; 40(8): 913–929. https://doi.org/10.1111/ecog.02881
  • [21] Eurostat. Greenhouse gas emissions by source sector (dataset env_air_gge). European Commission; Published 2025. Accessed February 14, 2025. https://ec.europa.eu/eurostat/databrowser/view/env_air_gge/default/table
  • [22] Hancock JT, Khoshgoftaar TM. Survey on categorical data for neural networks. J Big Data 2020; 7: 28. https://doi.org/10.1186/s40537-020-00305-w
  • [23] Bergmeir C, Benítez JM. On the use of cross-validation for time series predictor evaluation. Inf Sci; 2012; 191: 192–213. https://doi.org/10.1016/j.ins.2011.12.028
  • [24] Joseph VR. Optimal ratio for data splitting. Stat Anal Data Min 2022; 15(4), 531-538.
  • [25] Tashman LJ. Out-of-sample tests of forecasting accuracy: An analysis and review. Int J Forecast 2000 16(4), 437–450.https://doi.org/10.1016/S0169-2070(00)00065-0
  • [26] Valavi R, Elith J, Lahoz-Monfort JJ, Guillera-Arroita G. blockCV: An R package for generating spatially or environmentally separated folds for k-fold cross-validation. Methods Ecol Evo 2019; 10(2): 225–232. https://doi.org/10.1111/2041-210X.13107
  • [27] Chai T, Draxler RR. Root mean square error (RMSE) or mean absolute error (MAE)? Arguments against avoiding RMSE. Geosci Model Dev 2014; 7(3): 1247–1250. https://doi.org/10.5194/gmd-7-1247-2014
  • [28] Chicco D, Warrens MJ, Jurman G. The coefficient of determination R² is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation. PeerJ Comput Sci 2021; 7: e623. https://doi.org/10.7717/peerj-cs.623

MACHINE LEARNING AND VALIDATION STRATEGIES IN PANEL DATA-BASED GREENHOUSE GAS EMISSION MODELING

Year 2026, Volume: 27 Issue: 1 , 204 - 219 , 27.03.2026
https://doi.org/10.18038/estubtda.1891746
https://izlik.org/JA87BX42RJ

Abstract

In this study, sector-based methane emissions of European countries were modeled using a Random Forest–based machine learning approach applied to a panel dataset covering the period 2014–2023 with country–sector–year dimensions. The primary objective of the study is not to maximize predictive accuracy, but to evaluate how different validation strategies affect model performance and generalization behavior. Accordingly, three validation strategies—random training–test split, temporal (time-based) validation, and country-based group validation—were comparatively analyzed. The dataset, obtained from Eurostat, comprises 29 countries, 5 sectors, and 1,449 observations. Model performance was evaluated using root mean square error and the coefficient of determination. Under random splitting, the model achieved very low errors (mean RMSE = 0.0126 ± 0.0025; mean R² = 0.9993 ± 0.0003), although these results may be optimistic due to information leakage. Temporal validation yielded stable near-future performance (RMSE = 0.0225, R² = 0.9975). In contrast, country-based group validation resulted in a substantial performance decline (average RMSE = 0.3132 ± 0.4061), indicating strong cross-country heterogeneity. Overall, the findings demonstrate that, in panel data settings, the choice of validation strategy is as critical as the machine learning algorithm for realistic generalization assessment.

Ethical Statement

No human- subjects data were collected therefore, IRB/ethics committee approval was not required.

Supporting Institution

No external funding was received for this study.

References

  • [1] World Meteorological Organization. WMO Greenhouse Gas Bulletin No. 19: The State of Greenhouse Gases in the Atmosphere. World Meteorological Organization; 2023. Accessed: December 14, 2025. https://bpb-us-w2.wpmucdn.com/blog.nus.edu.sg/dist/0/15540/files/2019/11/ghg_bulletin_en.pdf
  • [2] World Meteorological Organization. State of the Global Climate 2021. World Meteorological Organization; 2022. Accessed: February 14, 2025. https://wmo.int/resources/publication-series/state-of-global-climate/state-of-global-climate-2021
  • [3] Gan N, Zhao S. Global greenhouse gas reduction forecasting via machine learning model in the scenario of energy transition. J Environ Manage 2024;371:123309.
  • [4] Eurostat. Greenhouse gas emissions by source sector. Eurostat; 2024. Accessed October 09, 2025. https://ec.europa.eu/eurostat
  • [5] UNFCCC. Greenhouse Gas Inventory Data – Time Series. UNFCCC; 2025. Accessed January 05, 2025. https://di.unfccc.int/time_series
  • [6] Crippa M, Solazzo E, Huang G, Guizzardi D, Koffi E, Muntean M, Schieberle C, Friedrich R, Janssens-Maenhout G. High resolution temporal profiles in the Emissions Database for Global Atmospheric Research. Sci Data 2020; 7(1):121.
  • [7] Wood R, Neuhoff K, Moran D, Simas M, Grubb M, Stadler K. The structure, drivers and policy implications of the European carbon footprint. Clim Policy 2020; 20(1), S39-S57.
  • [8] Marotta A, Porras-Amores C, Rodríguez Sánchez A, Villoria Sáez P, Maser G. Greenhouse gas emissions forecasts in countries of the european union by means of a multifactor algorithm. Applied Sciences 2023;13(14), 8520.
  • [9] Ene Yalçın, S. Development of a Forecasting Framework Based on Advanced Machine Learning Algorithms for Greenhouse Gas Emissions. Systems 2024; 12(12): 528.
  • [10] Berrington A, Halpin B, Wiggins R. An overview of methods for the analysis of panel data. NCRM Methods Review Paper NCRM/007. National Centre for Research Methods. 2006. Accessed March 14, 2026. https://eprints.ncrm.ac.uk/id/eprint/415/1/MethodsReviewPaperNCRM-007.pdf
  • [11] Wooldridge JM. Econometric analysis of cross section and panel data. MIT press; 2010
  • [12] Athey S, Imbens GW. Machine learning methods that economists should know about. Annu Rev Econ 2019; 11(1): 685-725.
  • [13] Bakay MS, Ağbulut Ü. Electricity production based forecasting of greenhouse gas emissions in Turkey with deep learning, support vector machine and artificial neural network algorithms. J Clean Prod 2021; 285: 125324.
  • [14] Aksu İÖ, Demirdelen T. The new prediction methodology for CO2 emission to ensure energy sustainability with the hybrid artificial neural network approach. Sustainability 2022; 14(23): 15595.
  • [15] Uzel H, Alpsalaz F, Aslan E, Özüpak YA. Comprehensive Benchmark Of Linear And Ensemble Machine Learning Models For Global Co₂ Emission Forecasting. Middle East J Sci 2025; 11(2): 247-262.
  • [16] Cerqua A, Letta M, Pinto G. On the (Mis) Use of Machine Learning With Panel Data, Oxf Bull Econ Stat 2025; 0:1-13. doi:10.1111/obes.70019.
  • [17] Tian Y, Ren X, Li K, Li X. Carbon Dioxide Emission Forecast: A Review of Existing Models and Future Challenges. Sustainability 2025;17(4): 1471.
  • [18] Breiman L. Random forests. Machine Learning 2001; 45(1): 5–32. https://doi.org/10.1023/A:1010933404324
  • [19] Kohavi R. A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI) 1995; Montreal, Canada. 1137-1145.
  • [20] Roberts DR, Bahn V, Ciuti S, Boyce MS, Elith J, Guillera-Arroita G, Hauenstein S, Lahoz-Monfort JJ, Schröder B, Thuiller W, Warton DI, Wintle BA, Hartig F, Dormann CF. Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Ecography; 2017; 40(8): 913–929. https://doi.org/10.1111/ecog.02881
  • [21] Eurostat. Greenhouse gas emissions by source sector (dataset env_air_gge). European Commission; Published 2025. Accessed February 14, 2025. https://ec.europa.eu/eurostat/databrowser/view/env_air_gge/default/table
  • [22] Hancock JT, Khoshgoftaar TM. Survey on categorical data for neural networks. J Big Data 2020; 7: 28. https://doi.org/10.1186/s40537-020-00305-w
  • [23] Bergmeir C, Benítez JM. On the use of cross-validation for time series predictor evaluation. Inf Sci; 2012; 191: 192–213. https://doi.org/10.1016/j.ins.2011.12.028
  • [24] Joseph VR. Optimal ratio for data splitting. Stat Anal Data Min 2022; 15(4), 531-538.
  • [25] Tashman LJ. Out-of-sample tests of forecasting accuracy: An analysis and review. Int J Forecast 2000 16(4), 437–450.https://doi.org/10.1016/S0169-2070(00)00065-0
  • [26] Valavi R, Elith J, Lahoz-Monfort JJ, Guillera-Arroita G. blockCV: An R package for generating spatially or environmentally separated folds for k-fold cross-validation. Methods Ecol Evo 2019; 10(2): 225–232. https://doi.org/10.1111/2041-210X.13107
  • [27] Chai T, Draxler RR. Root mean square error (RMSE) or mean absolute error (MAE)? Arguments against avoiding RMSE. Geosci Model Dev 2014; 7(3): 1247–1250. https://doi.org/10.5194/gmd-7-1247-2014
  • [28] Chicco D, Warrens MJ, Jurman G. The coefficient of determination R² is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation. PeerJ Comput Sci 2021; 7: e623. https://doi.org/10.7717/peerj-cs.623
There are 28 citations in total.

Details

Primary Language English
Subjects Artificial Intelligence (Other)
Journal Section Research Article
Authors

Deniz Demircioğlu Diren 0000-0002-4280-0394

Submission Date February 17, 2026
Acceptance Date March 19, 2026
Publication Date March 27, 2026
DOI https://doi.org/10.18038/estubtda.1891746
IZ https://izlik.org/JA87BX42RJ
Published in Issue Year 2026 Volume: 27 Issue: 1

Cite

AMA 1.Demircioğlu Diren D. MACHINE LEARNING AND VALIDATION STRATEGIES IN PANEL DATA-BASED GREENHOUSE GAS EMISSION MODELING. Estuscience - Se. 2026;27(1):204-219. doi:10.18038/estubtda.1891746