Comparison of Some Performance Metrics Used in Multiple Classification Problems

Ali Vasfi Ağlarcı; Cengiz Bal

doi:10.21180/iibfdkastamonu.1561910

Araştırma Makalesi

Çoklu Sınıflandırma Problemlerinde Kullanılan Bazı Performans Ölçütlerinin Karşılaştırılması

Yıl 2025, Cilt: 27 Sayı: 1, 22 - 39, 30.06.2025

Ali Vasfi Ağlarcı , Cengiz Bal

Öz

Bu araştırmanın amacı, makine öğrenmesinde birden fazla sınıflandırma probleminde kullanılan performans metriklerini karşılaştırmaktır. Bu amaçla 4 farklı sınıflandırma yöntemi kullanılarak farklı senaryolar altında simülasyon çalışması yapılmış ve elde edilen performans metrikleri bu doğrultuda karşılaştırılmıştır. Çalışmada performans metrikleri karşılaştırılırken, sınıflandırma amacıyla kullanılacak veriler 4 faktörün etkisi dikkate alınarak farklı senaryolar altında türetilmiştir. Yanıt değişkeninin 3 farklı kategori sayısı, 5 farklı örneklem büyüklüğü, 3 farklı korelasyon yapısı ve yanıt değişkeninin dengeli ve dengesiz dağılımı dikkate alınarak 90 farklı senaryo oluşturulmuştur. Çoklu sınıflandırma problemlerinde kullanılan Accuray, Kappa ve CramerV metrikleri performans ölçüsü olarak kullanılmıştır. Belirlenen senaryolardaki performans metriklerindeki değişimler tablolar halinde özetlenmiş ve karşılaştırılmıştır. Simülasyon çalışması ile yapılan karşılaştırmalar sonucunda, Kappa performans ölçütünün çok sınıflı sınıflandırma problemlerinde diğer iki metriğe göre daha doğru bir performans metriği olduğu ve yöntemin sınıflandırma başarısı hakkında daha güvenilir bilgi verdiği görülmüştür.

Anahtar Kelimeler

Sınıflandırma başarısı , sınıflandırma performansı , makine öğrenimi , simülasyon , performans ölçümleri

Kaynakça

Ballabio, D., Grisoni, F. & Todeschini, R. (2018). Multivariate Comparison of Classification Performance Measures. Chemometrics and Intelligent Laboratory Systems, 174, 33-44. https://doi.org/10.1016/j.chemolab.2017.12.004.
Bishop, C. M. (2007). Pattern Recognition and Machine Learning. New York: Springer. ISBN: 0-387- 31073-8.
Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5–32.
Breiman, L. F., Jerome, O. A., Richard, S. J. & Stone, C. (1993). Classification and Regression Trees. New York: Chapman & Hall.
Bridge, D. (2013). Classification: K-nearest Neighbours. Online Courses. Retrieved from www.cs.ucc.ie/~dgb/ courses/tai/notes/handout4.pdf, Accessed time: 12.08.2023.
Chen, P., Lien, C., Wu, W., Lee, L. & Shaw, J. (2020). Gait-Based Machine Learning for Classifying Patients with Different Types of Mild Cognitive Impairment. Journal of Medical Systems, 44(6),107-120.
De Diego, I. M., Redondo, A. R., Fernández, R. R., Navarro, J. & Moguerza, J. M. (2022). General Performance Score for Classification Problems. Applied Intelligence, 52(10), 12049-12063.
Dhasaradhan, K. & Jaichandran, R. (2022). Performance Analysis of Machine Learning Algorithms in Heart Disease Prediction. Concurrent Engineering, 30(4), 335-343.
Fatourechi, M., Ward, R. K., Mason, S. G., Huggins, J., Schlögl A. & Birch, G. E. (2008).Comparison of Evaluation Metrics in Classification Applications with Imbalanced Datasets. Seventh International Conference on Machine Learning and Applications 2008, 777-782.
Fávero, L.P., Belfiore, P. & Souza, R.F. (2023). Bivariate Descriptive Statistics. In: L. P. Fávero, P. Belfiore & R. F. Souza (Eds.), Data Science, Analytics and Machine Learning with R (pp. 63-71). Academic Press. https://doi.org/10.1016/B978-0-12-824271-1.00003-2.
Ferri, C., Hernández-Orallo, J. & Modroiu, R. (2009). An Experimental Comparison of Performance Measures for Classification. Pattern Recognition Letters, 30(1), 27–38.
Folorunso, S. O., Awotunde, J. B., Adeniyi, E. A., Abiodun, K. M. & Ayo, F. E. (2022). Heart Disease Classification Using Machine Learning Models. In: S. Misra, J. Oluranti, R. Damaševičius & R. Maskeliunas (Eds.), Communications in Computer and Information Science (pp. 35-49). Springer, Cham. https://doi.org/10.1007/978-3-030-95630-1_3
Gösgens, M., Zhiyanov, A., Tikhonov, A. & Prokhorenkova, L. (2021). Good Classification Measures and How to Find Them. 35th Conference on Neural Information Processing Systems 2021, 1-12.
Grandini, M., Bagli, E. & Visani, G. (2020) Metrics for Multi-Class Classification: An Overview. arXiv 2020, (1-17). https://doi.org/10.48550/arXiv.2008.05756.
Gu, Q., Zhu, L. & Cai, Z. (2009). Evaluation Measures of the Classification Performance of Imbalanced Data Sets. In: Z. Cai, Z. Li, Z. Kang & Y. Liu (Eds.), Communications in Computer and Information Science (pp. 461-471). Berlin: Springer. https://doi.org/10.1007/978-3-642-04962-0_53
Hosmer Jr, D. W., Lemeshow, S. & Sturdivant, R. X. (2013). Applied Logistic Regression. New York: John Wiley and Sons.
Hossin, M. & Sulaiman, M. (2015). A Review on Evaluation Metrics for Data Classification Evaluations. International Journal of Data Mining & Knowledge Management Process, 5(2), 1-11.
Huang, C., Yang, Y., Yang, D. & Chen, Y. (2009). Frog Classification Using Machine Learning Techniques. Expert Systems with Applications, 36(2), 3737-3743.
Jeni, L. A., Cohn, J. F. & Torre, F. D. (2013). Facing Imbalanced Data--Recommendations for the Use of Performance Metrics. Humaine Association Affective Computing and Intelligent Interaction Conference 2013, 245-251.
Jeong, B., Cho, H., Kim, J., Kwon, S., Hong, S., Lee, C. & Heo, T. (2020). Comparison between Statistical Models and Machine Learning Methods on Classification for Highly Imbalanced Multiclass Kidney Data. Diagnostics, 10(6), 415.
Kumar, A., Sushil, R. & Tiwari, A. K. (2019). Significance of Accuracy Levels in Cancer Prediction using Machine Learning Techniques. Bioscience Biotechnology Research Communications, 12(3), 741-747.
Landis, J. R. & Koch, G. G. (1977). The Measurement of Observer Agreement for Categorical Data. Biometrics, 33(1), 159-174.
Luque, A., Carrasco, A., Martín, A. & Heras, A. (2019). The Impact of Class Imbalance in Classification Performance Metrics Based on the Binary Confusion Matrix. Pattern Recognition, 91, 216–231.
McHugh, M. L. (2012). Interrater Reliability: The Kappa Statistic. Biochemia Medica, 22(3), 276–282.
Metz, C. E. (1978). Basic Principles of ROC Analysis (PDF). Seminars in Nuclear Medicine, 8(4), 283–298. doi:10.1016/s0001-2998(78)80014-2.
Mingxing, G. (2021). A Novel Performance Measure for Machine Learning Classification. International Journal of Managing Information Technology, 13(1), 1-19.
Patel, A. C. & Markey, M. K. (2005). Comparison of Three-Class Classification Performance Metrics: A Case Study in Breast Cancer CAD. Medical Imaging 2005: Image Perception, Observer Performance, and Technology Assessment 2005. https://doi.org/10.1117/12.595763
Pereira, L. & Nunes, N. (2017). A Comparison of Performance Metrics for Event Classification in Non-Intrusive Load Monitoring. IEEE International Conference on Smart Grid Communications 2017, 159-164.
Powers, D. (2011). Evaluation: From Precision, Recall and F-Measure to Roc, Informedness, Markedness & Correlation. Journal of Machine Learning Technologies, 2(1), 37–63.
Rácz, A., Bajusz, D. & Héberger, K. (2019). Multi-Level Comparison of Machine Learning Classifiers and Their Performance Metrics. Molecules, 24(15), 1-18.
Stehman, S. V. (1997). Selecting and Interpreting Measures of Thematic Classification Accuracy. Remote Sensing of Environment, 62(1), 77–89.

Comparison of Some Performance Metrics Used in Multiple Classification Problems

Yıl 2025, Cilt: 27 Sayı: 1, 22 - 39, 30.06.2025

Ali Vasfi Ağlarcı , Cengiz Bal

Öz

The purpose of this research is to compare the performance metrics used in multiple classification problems in machine learning. For this purpose, simulation study was carried out under different scenarios by using 4 different classification methods and the performance metrics obtained were compared in this direction. While comparing the performance metrics in the study, the data to be used for classification purposes were derived under different scenarios, taking into account the effect of 4 factors. 90 different scenarios were created by considering the number of 3 different categories of the response variable, 5 different sample sizes, 3 different correlation structures, and the balanced and unbalanced distribution of the response variable. Accuray, Kappa and CramerV metrics used in multiple classification problems were used as performance measures. Changes in performance metrics in the determined scenarios are summarized in tables and compared. As a result of the comparisons made with the simulation study, it has been seen that Kappa performance measure is a more accurate performance metric than the other two metrics in multi-class classification problems, and the method gives more reliable information about the classification success.

Anahtar Kelimeler

Classification success , classification performance , machine learning , simulation , performance metrics

Kaynakça

Ballabio, D., Grisoni, F. & Todeschini, R. (2018). Multivariate Comparison of Classification Performance Measures. Chemometrics and Intelligent Laboratory Systems, 174, 33-44. https://doi.org/10.1016/j.chemolab.2017.12.004.
Bishop, C. M. (2007). Pattern Recognition and Machine Learning. New York: Springer. ISBN: 0-387- 31073-8.
Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5–32.
Breiman, L. F., Jerome, O. A., Richard, S. J. & Stone, C. (1993). Classification and Regression Trees. New York: Chapman & Hall.
Bridge, D. (2013). Classification: K-nearest Neighbours. Online Courses. Retrieved from www.cs.ucc.ie/~dgb/ courses/tai/notes/handout4.pdf, Accessed time: 12.08.2023.
Chen, P., Lien, C., Wu, W., Lee, L. & Shaw, J. (2020). Gait-Based Machine Learning for Classifying Patients with Different Types of Mild Cognitive Impairment. Journal of Medical Systems, 44(6),107-120.
De Diego, I. M., Redondo, A. R., Fernández, R. R., Navarro, J. & Moguerza, J. M. (2022). General Performance Score for Classification Problems. Applied Intelligence, 52(10), 12049-12063.
Dhasaradhan, K. & Jaichandran, R. (2022). Performance Analysis of Machine Learning Algorithms in Heart Disease Prediction. Concurrent Engineering, 30(4), 335-343.
Fatourechi, M., Ward, R. K., Mason, S. G., Huggins, J., Schlögl A. & Birch, G. E. (2008).Comparison of Evaluation Metrics in Classification Applications with Imbalanced Datasets. Seventh International Conference on Machine Learning and Applications 2008, 777-782.
Fávero, L.P., Belfiore, P. & Souza, R.F. (2023). Bivariate Descriptive Statistics. In: L. P. Fávero, P. Belfiore & R. F. Souza (Eds.), Data Science, Analytics and Machine Learning with R (pp. 63-71). Academic Press. https://doi.org/10.1016/B978-0-12-824271-1.00003-2.
Ferri, C., Hernández-Orallo, J. & Modroiu, R. (2009). An Experimental Comparison of Performance Measures for Classification. Pattern Recognition Letters, 30(1), 27–38.
Folorunso, S. O., Awotunde, J. B., Adeniyi, E. A., Abiodun, K. M. & Ayo, F. E. (2022). Heart Disease Classification Using Machine Learning Models. In: S. Misra, J. Oluranti, R. Damaševičius & R. Maskeliunas (Eds.), Communications in Computer and Information Science (pp. 35-49). Springer, Cham. https://doi.org/10.1007/978-3-030-95630-1_3
Gösgens, M., Zhiyanov, A., Tikhonov, A. & Prokhorenkova, L. (2021). Good Classification Measures and How to Find Them. 35th Conference on Neural Information Processing Systems 2021, 1-12.
Grandini, M., Bagli, E. & Visani, G. (2020) Metrics for Multi-Class Classification: An Overview. arXiv 2020, (1-17). https://doi.org/10.48550/arXiv.2008.05756.
Gu, Q., Zhu, L. & Cai, Z. (2009). Evaluation Measures of the Classification Performance of Imbalanced Data Sets. In: Z. Cai, Z. Li, Z. Kang & Y. Liu (Eds.), Communications in Computer and Information Science (pp. 461-471). Berlin: Springer. https://doi.org/10.1007/978-3-642-04962-0_53
Hosmer Jr, D. W., Lemeshow, S. & Sturdivant, R. X. (2013). Applied Logistic Regression. New York: John Wiley and Sons.
Hossin, M. & Sulaiman, M. (2015). A Review on Evaluation Metrics for Data Classification Evaluations. International Journal of Data Mining & Knowledge Management Process, 5(2), 1-11.
Huang, C., Yang, Y., Yang, D. & Chen, Y. (2009). Frog Classification Using Machine Learning Techniques. Expert Systems with Applications, 36(2), 3737-3743.
Jeni, L. A., Cohn, J. F. & Torre, F. D. (2013). Facing Imbalanced Data--Recommendations for the Use of Performance Metrics. Humaine Association Affective Computing and Intelligent Interaction Conference 2013, 245-251.
Jeong, B., Cho, H., Kim, J., Kwon, S., Hong, S., Lee, C. & Heo, T. (2020). Comparison between Statistical Models and Machine Learning Methods on Classification for Highly Imbalanced Multiclass Kidney Data. Diagnostics, 10(6), 415.
Kumar, A., Sushil, R. & Tiwari, A. K. (2019). Significance of Accuracy Levels in Cancer Prediction using Machine Learning Techniques. Bioscience Biotechnology Research Communications, 12(3), 741-747.
Landis, J. R. & Koch, G. G. (1977). The Measurement of Observer Agreement for Categorical Data. Biometrics, 33(1), 159-174.
Luque, A., Carrasco, A., Martín, A. & Heras, A. (2019). The Impact of Class Imbalance in Classification Performance Metrics Based on the Binary Confusion Matrix. Pattern Recognition, 91, 216–231.
McHugh, M. L. (2012). Interrater Reliability: The Kappa Statistic. Biochemia Medica, 22(3), 276–282.
Metz, C. E. (1978). Basic Principles of ROC Analysis (PDF). Seminars in Nuclear Medicine, 8(4), 283–298. doi:10.1016/s0001-2998(78)80014-2.
Mingxing, G. (2021). A Novel Performance Measure for Machine Learning Classification. International Journal of Managing Information Technology, 13(1), 1-19.
Patel, A. C. & Markey, M. K. (2005). Comparison of Three-Class Classification Performance Metrics: A Case Study in Breast Cancer CAD. Medical Imaging 2005: Image Perception, Observer Performance, and Technology Assessment 2005. https://doi.org/10.1117/12.595763
Pereira, L. & Nunes, N. (2017). A Comparison of Performance Metrics for Event Classification in Non-Intrusive Load Monitoring. IEEE International Conference on Smart Grid Communications 2017, 159-164.
Powers, D. (2011). Evaluation: From Precision, Recall and F-Measure to Roc, Informedness, Markedness & Correlation. Journal of Machine Learning Technologies, 2(1), 37–63.
Rácz, A., Bajusz, D. & Héberger, K. (2019). Multi-Level Comparison of Machine Learning Classifiers and Their Performance Metrics. Molecules, 24(15), 1-18.
Stehman, S. V. (1997). Selecting and Interpreting Measures of Thematic Classification Accuracy. Remote Sensing of Environment, 62(1), 77–89.

Toplam 31 adet kaynakça vardır.

Ayrıntılar

Birincil Dil	İngilizce
Konular	Ekonometrik ve İstatistiksel Yöntemler
Bölüm	Araştırma Makalesi
Yazarlar	Ali Vasfi Ağlarcı 0000-0002-9010-4537 Cengiz Bal 0000-0002-1553-2902
Yayımlanma Tarihi	30 Haziran 2025
Gönderilme Tarihi	5 Ekim 2024
Kabul Tarihi	14 Mart 2025
Yayımlandığı Sayı	Yıl 2025 Cilt: 27 Sayı: 1

Kaynak Göster

APA	Ağlarcı, A. V., & Bal, C. (2025). Comparison of Some Performance Metrics Used in Multiple Classification Problems. Kastamonu Üniversitesi İktisadi ve İdari Bilimler Fakültesi Dergisi, 27(1), 22-39. https://doi.org/10.21180/iibfdkastamonu.1561910

Kapak Resmi İndir

Makale Dosyaları

Tam Metin