TY - JOUR T1 - Comparison of Some Performance Metrics Used in Multiple Classification Problems TT - Çoklu Sınıflandırma Problemlerinde Kullanılan Bazı Performans Ölçütlerinin Karşılaştırılması AU - Ağlarcı, Ali Vasfi AU - Bal, Cengiz PY - 2025 DA - June Y2 - 2025 DO - 10.21180/iibfdkastamonu.1561910 JF - Kastamonu Üniversitesi İktisadi ve İdari Bilimler Fakültesi Dergisi PB - Kastamonu Üniversitesi WT - DergiPark SN - 2147-6012 SP - 22 EP - 39 VL - 27 IS - 1 LA - en AB - The purpose of this research is to compare the performance metrics used in multiple classification problems in machine learning. For this purpose, simulation study was carried out under different scenarios by using 4 different classification methods and the performance metrics obtained were compared in this direction. While comparing the performance metrics in the study, the data to be used for classification purposes were derived under different scenarios, taking into account the effect of 4 factors. 90 different scenarios were created by considering the number of 3 different categories of the response variable, 5 different sample sizes, 3 different correlation structures, and the balanced and unbalanced distribution of the response variable. Accuray, Kappa and CramerV metrics used in multiple classification problems were used as performance measures. Changes in performance metrics in the determined scenarios are summarized in tables and compared. As a result of the comparisons made with the simulation study, it has been seen that Kappa performance measure is a more accurate performance metric than the other two metrics in multi-class classification problems, and the method gives more reliable information about the classification success. KW - Classification success KW - classification performance KW - machine learning KW - simulation KW - performance metrics N2 - Bu araştırmanın amacı, makine öğrenmesinde birden fazla sınıflandırma probleminde kullanılan performans metriklerini karşılaştırmaktır. Bu amaçla 4 farklı sınıflandırma yöntemi kullanılarak farklı senaryolar altında simülasyon çalışması yapılmış ve elde edilen performans metrikleri bu doğrultuda karşılaştırılmıştır. Çalışmada performans metrikleri karşılaştırılırken, sınıflandırma amacıyla kullanılacak veriler 4 faktörün etkisi dikkate alınarak farklı senaryolar altında türetilmiştir. Yanıt değişkeninin 3 farklı kategori sayısı, 5 farklı örneklem büyüklüğü, 3 farklı korelasyon yapısı ve yanıt değişkeninin dengeli ve dengesiz dağılımı dikkate alınarak 90 farklı senaryo oluşturulmuştur. Çoklu sınıflandırma problemlerinde kullanılan Accuray, Kappa ve CramerV metrikleri performans ölçüsü olarak kullanılmıştır. Belirlenen senaryolardaki performans metriklerindeki değişimler tablolar halinde özetlenmiş ve karşılaştırılmıştır. Simülasyon çalışması ile yapılan karşılaştırmalar sonucunda, Kappa performans ölçütünün çok sınıflı sınıflandırma problemlerinde diğer iki metriğe göre daha doğru bir performans metriği olduğu ve yöntemin sınıflandırma başarısı hakkında daha güvenilir bilgi verdiği görülmüştür. CR - Ballabio, D., Grisoni, F. & Todeschini, R. (2018). Multivariate Comparison of Classification Performance Measures. Chemometrics and Intelligent Laboratory Systems, 174, 33-44. https://doi.org/10.1016/j.chemolab.2017.12.004. CR - Bishop, C. M. (2007). Pattern Recognition and Machine Learning. New York: Springer. ISBN: 0-387- 31073-8. CR - Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5–32. CR - Breiman, L. F., Jerome, O. A., Richard, S. J. & Stone, C. (1993). Classification and Regression Trees. New York: Chapman & Hall. CR - Bridge, D. (2013). Classification: K-nearest Neighbours. Online Courses. Retrieved from www.cs.ucc.ie/~dgb/ courses/tai/notes/handout4.pdf, Accessed time: 12.08.2023. CR - Chen, P., Lien, C., Wu, W., Lee, L. & Shaw, J. (2020). Gait-Based Machine Learning for Classifying Patients with Different Types of Mild Cognitive Impairment. Journal of Medical Systems, 44(6),107-120. CR - De Diego, I. M., Redondo, A. R., Fernández, R. R., Navarro, J. & Moguerza, J. M. (2022). General Performance Score for Classification Problems. Applied Intelligence, 52(10), 12049-12063. CR - Dhasaradhan, K. & Jaichandran, R. (2022). Performance Analysis of Machine Learning Algorithms in Heart Disease Prediction. Concurrent Engineering, 30(4), 335-343. CR - Fatourechi, M., Ward, R. K., Mason, S. G., Huggins, J., Schlögl A. & Birch, G. E. (2008).Comparison of Evaluation Metrics in Classification Applications with Imbalanced Datasets. Seventh International Conference on Machine Learning and Applications 2008, 777-782. CR - Fávero, L.P., Belfiore, P. & Souza, R.F. (2023). Bivariate Descriptive Statistics. In: L. P. Fávero, P. Belfiore & R. F. Souza (Eds.), Data Science, Analytics and Machine Learning with R (pp. 63-71). Academic Press. https://doi.org/10.1016/B978-0-12-824271-1.00003-2. CR - Ferri, C., Hernández-Orallo, J. & Modroiu, R. (2009). An Experimental Comparison of Performance Measures for Classification. Pattern Recognition Letters, 30(1), 27–38. CR - Folorunso, S. O., Awotunde, J. B., Adeniyi, E. A., Abiodun, K. M. & Ayo, F. E. (2022). Heart Disease Classification Using Machine Learning Models. In: S. Misra, J. Oluranti, R. Damaševičius & R. Maskeliunas (Eds.), Communications in Computer and Information Science (pp. 35-49). Springer, Cham. https://doi.org/10.1007/978-3-030-95630-1_3 CR - Gösgens, M., Zhiyanov, A., Tikhonov, A. & Prokhorenkova, L. (2021). Good Classification Measures and How to Find Them. 35th Conference on Neural Information Processing Systems 2021, 1-12. CR - Grandini, M., Bagli, E. & Visani, G. (2020) Metrics for Multi-Class Classification: An Overview. arXiv 2020, (1-17). https://doi.org/10.48550/arXiv.2008.05756. CR - Gu, Q., Zhu, L. & Cai, Z. (2009). Evaluation Measures of the Classification Performance of Imbalanced Data Sets. In: Z. Cai, Z. Li, Z. Kang & Y. Liu (Eds.), Communications in Computer and Information Science (pp. 461-471). Berlin: Springer. https://doi.org/10.1007/978-3-642-04962-0_53 CR - Hosmer Jr, D. W., Lemeshow, S. & Sturdivant, R. X. (2013). Applied Logistic Regression. New York: John Wiley and Sons. CR - Hossin, M. & Sulaiman, M. (2015). A Review on Evaluation Metrics for Data Classification Evaluations. International Journal of Data Mining & Knowledge Management Process, 5(2), 1-11. CR - Huang, C., Yang, Y., Yang, D. & Chen, Y. (2009). Frog Classification Using Machine Learning Techniques. Expert Systems with Applications, 36(2), 3737-3743. CR - Jeni, L. A., Cohn, J. F. & Torre, F. D. (2013). Facing Imbalanced Data--Recommendations for the Use of Performance Metrics. Humaine Association Affective Computing and Intelligent Interaction Conference 2013, 245-251. CR - Jeong, B., Cho, H., Kim, J., Kwon, S., Hong, S., Lee, C. & Heo, T. (2020). Comparison between Statistical Models and Machine Learning Methods on Classification for Highly Imbalanced Multiclass Kidney Data. Diagnostics, 10(6), 415. CR - Kumar, A., Sushil, R. & Tiwari, A. K. (2019). Significance of Accuracy Levels in Cancer Prediction using Machine Learning Techniques. Bioscience Biotechnology Research Communications, 12(3), 741-747. CR - Landis, J. R. & Koch, G. G. (1977). The Measurement of Observer Agreement for Categorical Data. Biometrics, 33(1), 159-174. CR - Luque, A., Carrasco, A., Martín, A. & Heras, A. (2019). The Impact of Class Imbalance in Classification Performance Metrics Based on the Binary Confusion Matrix. Pattern Recognition, 91, 216–231. CR - McHugh, M. L. (2012). Interrater Reliability: The Kappa Statistic. Biochemia Medica, 22(3), 276–282. CR - Metz, C. E. (1978). Basic Principles of ROC Analysis (PDF). Seminars in Nuclear Medicine, 8(4), 283–298. doi:10.1016/s0001-2998(78)80014-2. CR - Mingxing, G. (2021). A Novel Performance Measure for Machine Learning Classification. International Journal of Managing Information Technology, 13(1), 1-19. CR - Patel, A. C. & Markey, M. K. (2005). Comparison of Three-Class Classification Performance Metrics: A Case Study in Breast Cancer CAD. Medical Imaging 2005: Image Perception, Observer Performance, and Technology Assessment 2005. https://doi.org/10.1117/12.595763 CR - Pereira, L. & Nunes, N. (2017). A Comparison of Performance Metrics for Event Classification in Non-Intrusive Load Monitoring. IEEE International Conference on Smart Grid Communications 2017, 159-164. CR - Powers, D. (2011). Evaluation: From Precision, Recall and F-Measure to Roc, Informedness, Markedness & Correlation. Journal of Machine Learning Technologies, 2(1), 37–63. CR - Rácz, A., Bajusz, D. & Héberger, K. (2019). Multi-Level Comparison of Machine Learning Classifiers and Their Performance Metrics. Molecules, 24(15), 1-18. CR - Stehman, S. V. (1997). Selecting and Interpreting Measures of Thematic Classification Accuracy. Remote Sensing of Environment, 62(1), 77–89. UR - https://doi.org/10.21180/iibfdkastamonu.1561910 L1 - https://dergipark.org.tr/tr/download/article-file/4265889 ER -