Research Article
BibTex RIS Cite

Karar Ağaçlarının Aşırı Uyumla Sınavı: Karşılaştırmalı Bir Performans Analizi

Year 2025, Volume: 18 Issue: 3, 253 - 268, 31.07.2025
https://doi.org/10.17671/gazibtd.1594957

Abstract

Genelleme kabiliyetine ket vuran aşırı uyum probleminden Yapay Sinir Ağları (YSA) yönteminde metriklerin optimizasyonu yoluyla, Destek Vektör Makinelerinde (DVM), Vapnik-Chervonenkis (VC) Katsayısı yardımıyla, Karar Ağaçlarında (KA) ise ön budama (Pre-Pruning) ve son budama (Post_Pruning) yöntemiyle kaçınılmaya çalışılmaktadır. Bu çalışmada Gini ve Entropi ve Logaritmik Kayıp Fonksiyonu (LKF) indeksli KA modelleri, budamasız, ön budama sonrası ve son budama sonrası olmak üzere üç ayrı evrede, gerçek hayattan alınan bir kredi veri kümesine ve yine bu veriden türetilmiş “dengelenmiş” veri kümesine, üçüncül olarak da bir tarım verisine uygulanmış ve sonuçları YSA, DVM ve Lojistik Regresyon (LR) sınıflandırıcılarının sonuçları ile mukayese edilmiştir. Uygulamada YSA ve KA metrikleri Bayes (Optuna) ile optimize edilmiştir. Verilerin hepsine SMOTE dengeleme yöntemi uygulanmıştır. KA, her üç evrede tüm indeksleriyle birlikte, her üç veri kümesinde diğer yöntemlerinkinden belirgin bir şekilde daha iyi sonuçlar vermiştir. Sonuçlar Friedman’la test edilmiş ve modeller arasındaki sonuç farkları her üç veri kümesinde de anlamlı çıkmıştır. Üç veri kümesinin ikisinde indeks sonuçları arasındaki fark istatiksel olarak anlamlı değilken gözlem sayısı çok olan (11069) veri kümesinde ise anlamlı çıkmıştır. Çalışmanın sonuçlarına bakıldığında KA’nın kriterlerin tamamında (doğruluk, standart sapma, hassasiyet, geri çağırma, F1 ve AUC_ROC) verdiği sonuçların kalitesi göz önünde bulundurulduğunda güçlü ve pratik bir sınıflandırıcı olarak ön plana çıkmaktadır. Fakat ön budama ve son budama yöntemlerinin KA’nın genelleme kabiliyetine herhangi bir katkısının olmadığı sonucuna bu çalışma sonucunda varılmıştır.

References

  • S. Mussard, F. Seyte, M. Terraza, "Decomposition of Gini and the generalized entropy inequality measures", Economics Bulletin, 4(7), 1−6, 2003.
  • O. Rahmatia, M. Avand, P. Yariyan, J. P. Tiefenbacher, A. Azareh, D. T. Bui, “Assessment of Gini-, entropy- and ratio-basedclassification trees for groundwater potential modellingand prediction”, Geocarto International, 37(12), 3397–3415, 2022.
  • V. G. Costa, C. E. Pedreira, “Recent advances in decision trees: an updated survey”. Artifcial Intelligence Review, 56,4765–4800, 2023.
  • L. Zhang, X. Ma, Y. S. Ock, L. Qing, “Research on Regional Differences and Influencing Factors of Chinese Industrial Green Technology Innovation Efficiency Based on Dagum Gini Coefficient Decomposition”, Land, 11, 1-20, 2022.
  • Bertsimas, J. Dunn, “Optimal classification trees”, Mach Learn, 106, 1039–1082, 2017.
  • S. T. Biró, Z. Néda, “Gintropy: Gini Index Based Generalization of Entropy”, Entropy, 22, 1-13, 2020.
  • K. Ryu, D. J. Slottje, “Ten Ways to Specify a Gini Coefficient Using Entropy”, Annals of Financial Economics, 18(1), 1-19, 2023.
  • M. Pal, Ve P.M. Mather, “An assessment of the effectiveness of decision tree methods for land cover classification”, Remote Sensing of Environment, 86(4), 554-565, 2003.
  • M. Xu, P. Watanachaturaporn, P. K. Varshney, M. K. Arora, “Decision Tree Regression For Soft Classification Of Remote Sensing Data”, Remote Sensing Of Environment, 97(3), 322-336, 2005.
  • L. Zhang, X. Ma, Y. S. Ock, L. Qing, “Research on Regional Differences and Influencing Factors of Chinese Industrial Green Technology Innovation Efficiency Based on Dagum Gini Coefficient Decomposition”, Land, 11, 1-20, 2022.
  • S. Chakrabartty, G. Cauwenberghs, “Gini Support Vector Machine: Quadratic Entropy Based Robust Multi-Class Probability Regression”, Journal of Machine Learning Research, 8, 813-839, 2007.
  • L. Jost, “Entropy and Diversity”. Opinion, 3, 363-375, 2006.
  • T. Daniya, M. Geetha, K. Suresh, “Classification And Regression Trees With Gini Index”, Advances in Mathematics: Scientific Journal, 9(10), 8237–8247, 2020.
  • Y. Fogel, M. Feder, “Universal Batch Learning with Log-Loss”. 2018 IEEE International Symposium on Information Theory (ISIT), 6, 2018.
  • Vovk, V. “The Fundamental Nature of the Log Loss Function”, Fields of Logic and Computation II, Kitap Bölümü, Kahire, Mısır, 307-318, 2015.
  • H. Patel, P. Prajapati, “Study and Analysis of Decision Tree Based Classification Algorithms”, International Journal of Computer Sciences and Engineering (JCSE), 6(10), 74-78, 2018.
  • M. Bansal, A. Goyal, A. Choudhary, “A comparative analysis of K-Nearest Neighbor, Genetic, Support Vector Machine, Decision Tree, and Long Short Term Memory algorithms in machine learning”, Decision Analytics Journal, 3, 1-21, 2022.
  • www.jawatpoint.com, https://www.javatpoint.com/machine-learning-decision-tree-classification-algorithm, 27.11.2024.
  • T. Kavzoğlu, ve İ. Çölkesen, “Karar Ağaçları İle Uydu Görüntülerinin Sınıflandırılması: Kocaeli Örneği”, Harita Teknolojileri Elektronik Dergisi, 2(1), 36-45, 2010.
  • Gupta, A. Rawat, A. Jain, A. Arora, N. Dhami, “Analysis of Various Decision Tree Algorithms for Classification in Data Mining”, International Journal of Computer Applications, 163(8), 15-19, 2017.
  • Aggarwal, S. Kasiviswanathan, Z. Xu, O. Feyisetan, N. Teissier, “Label Inference Attacks from Log-loss Scores”, 38. Uluslararası Makine Öğrenmesi Konferansı Bildirileri, PMLR, 139, 120-129, 2021.
  • S.B. Kotsiantis, “Decision trees: a recent overview”. Artificial Intelligence Review, 39, 261-283, 2011.
  • Y.Y. Song, Y. Lu, “Decision Tree Methods: Applications For Classification And Prediction”, Shanghai Arch Psychiatry, 27(2), 130-135, 2015.
  • H. Sharma, S. Kumar, “A Survey on Decision Tree Algorithms of Classification in Data Mining”, International Journal of Science and Research (IJSR), 5(4), 2094-2097, 2016.
  • M. A. Friedl, ve C. E. Brodleyf, “Decision Tree Classification of Land Cover from Remotely Sensed Data”, Remote Sensing of Environment, 61(3), 399-409, 1997.
  • T. Dietterich, ACM Computmg Survevs, 27(3), 326-327, 1995.
  • D. Wang, Z. Peng, L. Xi, “The sum of weighted normalized square envelope: A unified framework for kurtosis, negative entropy, Gini index and smoothness index for machine health monitoring”, Mechanical Systems and Signal Processing, 140, 1-10, 2020.
  • Dembla, www.towards datascience.com, https://towardsdatascience.com/intuition-behind-log-loss-score-4e0c9979680a, 26.11.2024.
  • Corfield, B. Schölkopf, V. Vapnik, “Popper, Falsification And The VC Dimension”, Max Planck Institute for Biological Cybernetics, Teknik Rapor, 145, 1-4, 2005.

Testing Decision Trees with Overfitting: A Comparative Performance Analysis

Year 2025, Volume: 18 Issue: 3, 253 - 268, 31.07.2025
https://doi.org/10.17671/gazibtd.1594957

Abstract

The overfitting problem, which hampers generalization capability, is addressed through metric optimization in Artificial Neural Networks (ANN), with the aid of the Vapnik-Chervonenkis (VC) coefficient in Support Vector Machines (SVM), and via pre-pruning and post-pruning methods in Decision Trees (DT). In this study, DT models indexed by Gini, Entropy, and Logarithmic Loss Function (LLF) were applied in three separate phases—before pruning, after pre-pruning, and after post-pruning—to a real-world credit dataset, a “balanced” version of this dataset, and, thirdly, an agricultural dataset. Their results were compared with those of the classifiers ANN, SVM, and Logistic Regression (LR). In the implementation, ANN and DT metrics were optimized using Bayesian (Optuna) optimization, and SMOTE balancing was applied to all datasets. Across all three datasets and indices, DT consistently delivered significantly better results than the other methods in all three pruning phases. The differences in model results were tested with the Friedman test and found to be statistically significant in all three datasets. While the differences in index results were not statistically significant in two of the datasets, they were significant in the dataset with a large number of observations (11069).
Reviewing the study's findings, it can be concluded that DT stands out as a strong and practical classifier, considering the quality of the results it produced across all criteria (accuracy, standard deviation, precision, recall, F1, and AUC_ROC). However, this study found that pre-pruning and post-pruning methods did not contribute to the generalization capability of DT.

References

  • S. Mussard, F. Seyte, M. Terraza, "Decomposition of Gini and the generalized entropy inequality measures", Economics Bulletin, 4(7), 1−6, 2003.
  • O. Rahmatia, M. Avand, P. Yariyan, J. P. Tiefenbacher, A. Azareh, D. T. Bui, “Assessment of Gini-, entropy- and ratio-basedclassification trees for groundwater potential modellingand prediction”, Geocarto International, 37(12), 3397–3415, 2022.
  • V. G. Costa, C. E. Pedreira, “Recent advances in decision trees: an updated survey”. Artifcial Intelligence Review, 56,4765–4800, 2023.
  • L. Zhang, X. Ma, Y. S. Ock, L. Qing, “Research on Regional Differences and Influencing Factors of Chinese Industrial Green Technology Innovation Efficiency Based on Dagum Gini Coefficient Decomposition”, Land, 11, 1-20, 2022.
  • Bertsimas, J. Dunn, “Optimal classification trees”, Mach Learn, 106, 1039–1082, 2017.
  • S. T. Biró, Z. Néda, “Gintropy: Gini Index Based Generalization of Entropy”, Entropy, 22, 1-13, 2020.
  • K. Ryu, D. J. Slottje, “Ten Ways to Specify a Gini Coefficient Using Entropy”, Annals of Financial Economics, 18(1), 1-19, 2023.
  • M. Pal, Ve P.M. Mather, “An assessment of the effectiveness of decision tree methods for land cover classification”, Remote Sensing of Environment, 86(4), 554-565, 2003.
  • M. Xu, P. Watanachaturaporn, P. K. Varshney, M. K. Arora, “Decision Tree Regression For Soft Classification Of Remote Sensing Data”, Remote Sensing Of Environment, 97(3), 322-336, 2005.
  • L. Zhang, X. Ma, Y. S. Ock, L. Qing, “Research on Regional Differences and Influencing Factors of Chinese Industrial Green Technology Innovation Efficiency Based on Dagum Gini Coefficient Decomposition”, Land, 11, 1-20, 2022.
  • S. Chakrabartty, G. Cauwenberghs, “Gini Support Vector Machine: Quadratic Entropy Based Robust Multi-Class Probability Regression”, Journal of Machine Learning Research, 8, 813-839, 2007.
  • L. Jost, “Entropy and Diversity”. Opinion, 3, 363-375, 2006.
  • T. Daniya, M. Geetha, K. Suresh, “Classification And Regression Trees With Gini Index”, Advances in Mathematics: Scientific Journal, 9(10), 8237–8247, 2020.
  • Y. Fogel, M. Feder, “Universal Batch Learning with Log-Loss”. 2018 IEEE International Symposium on Information Theory (ISIT), 6, 2018.
  • Vovk, V. “The Fundamental Nature of the Log Loss Function”, Fields of Logic and Computation II, Kitap Bölümü, Kahire, Mısır, 307-318, 2015.
  • H. Patel, P. Prajapati, “Study and Analysis of Decision Tree Based Classification Algorithms”, International Journal of Computer Sciences and Engineering (JCSE), 6(10), 74-78, 2018.
  • M. Bansal, A. Goyal, A. Choudhary, “A comparative analysis of K-Nearest Neighbor, Genetic, Support Vector Machine, Decision Tree, and Long Short Term Memory algorithms in machine learning”, Decision Analytics Journal, 3, 1-21, 2022.
  • www.jawatpoint.com, https://www.javatpoint.com/machine-learning-decision-tree-classification-algorithm, 27.11.2024.
  • T. Kavzoğlu, ve İ. Çölkesen, “Karar Ağaçları İle Uydu Görüntülerinin Sınıflandırılması: Kocaeli Örneği”, Harita Teknolojileri Elektronik Dergisi, 2(1), 36-45, 2010.
  • Gupta, A. Rawat, A. Jain, A. Arora, N. Dhami, “Analysis of Various Decision Tree Algorithms for Classification in Data Mining”, International Journal of Computer Applications, 163(8), 15-19, 2017.
  • Aggarwal, S. Kasiviswanathan, Z. Xu, O. Feyisetan, N. Teissier, “Label Inference Attacks from Log-loss Scores”, 38. Uluslararası Makine Öğrenmesi Konferansı Bildirileri, PMLR, 139, 120-129, 2021.
  • S.B. Kotsiantis, “Decision trees: a recent overview”. Artificial Intelligence Review, 39, 261-283, 2011.
  • Y.Y. Song, Y. Lu, “Decision Tree Methods: Applications For Classification And Prediction”, Shanghai Arch Psychiatry, 27(2), 130-135, 2015.
  • H. Sharma, S. Kumar, “A Survey on Decision Tree Algorithms of Classification in Data Mining”, International Journal of Science and Research (IJSR), 5(4), 2094-2097, 2016.
  • M. A. Friedl, ve C. E. Brodleyf, “Decision Tree Classification of Land Cover from Remotely Sensed Data”, Remote Sensing of Environment, 61(3), 399-409, 1997.
  • T. Dietterich, ACM Computmg Survevs, 27(3), 326-327, 1995.
  • D. Wang, Z. Peng, L. Xi, “The sum of weighted normalized square envelope: A unified framework for kurtosis, negative entropy, Gini index and smoothness index for machine health monitoring”, Mechanical Systems and Signal Processing, 140, 1-10, 2020.
  • Dembla, www.towards datascience.com, https://towardsdatascience.com/intuition-behind-log-loss-score-4e0c9979680a, 26.11.2024.
  • Corfield, B. Schölkopf, V. Vapnik, “Popper, Falsification And The VC Dimension”, Max Planck Institute for Biological Cybernetics, Teknik Rapor, 145, 1-4, 2005.
There are 29 citations in total.

Details

Primary Language Turkish
Subjects Decision Support and Group Support Systems, Data Structures and Algorithms, Neural Networks, Data Mining and Knowledge Discovery
Journal Section Research Article
Authors

Gökhan Korkmaz 0000-0002-1702-2965

Publication Date July 31, 2025
Submission Date December 9, 2024
Acceptance Date June 25, 2025
Published in Issue Year 2025 Volume: 18 Issue: 3

Cite

APA Korkmaz, G. (2025). Karar Ağaçlarının Aşırı Uyumla Sınavı: Karşılaştırmalı Bir Performans Analizi. Bilişim Teknolojileri Dergisi, 18(3), 253-268. https://doi.org/10.17671/gazibtd.1594957