Makine Öğrenimi Teknikleri ile Kredi Risk Tahmininde Yeniden Örnekleme Yöntemlerinin Karşılaştırılması

Gülçin Kendirkıran; Seyyide Doğan

Araştırma Makalesi

Comparison of Machine Learning Techniques and Different Resampling Methods in Credit Risk Estimation

Yıl 2024, Cilt: 1 Sayı: 2, 48 - 60, 30.12.2024

Gülçin Kendirkıran , Seyyide Doğan

Öz

One of the most important issues affecting machine learning performance is class imbalance problems. In this case, which is frequently encountered in real-world problems, the effect of the minority class is ignored in the learning process and a biased estimation is obtained that shifts towards the majority class. This study presents the evaluation of 4 different resampling methods to cope with the class imbalance problem on the popular credit scoring dataset (Australian and German) in the UCI datasets. In this problem, where the credibility of bank customers is estimated, there are two customer classes classified as good and bad and are imbalanced. Different machine learning techniques such as Support Vector Machines (SVM), Random Forests (RF), Extra Boosting (XGBoost), CatBoost have been used in the prediction of risky customers and these algorithms have been combined with resampling approaches such as Random Oversampling (ROS), Random Undersampling (RUS), SMOTE and Tomek Linkages to eliminate the class imbalance problem. According to the experimental results, resampling methods are effective in improving the method performance and the SMOTE approach and the CatBoost classifier; It has been observed that ROS approach produces better results than RF classifier.

Anahtar Kelimeler

Machine Learning, Imbalance data, Credit Scoring

Kaynakça

Alam, T. M., Shaukat, K., Hameed, I. A., Luo, S., Sarwar, M. U., Shabbir, S., ... and Khushi, M. (2020). An investigation of credit card default prediction in the imbalanced datasets. IEEE Access, 8, 201173–201198.
Aly, S., Alfonse, M., Roushdy, M. I., & Salem, A. B. M. (2022). Developing an intelligent system for predicting bankruptcy. Journal of Theoretical and Applied Information Technology, 100(7), 2068–2088.
Anis, M., & Ali, M. (2017). Investigating the performance of smote for class imbalanced learning: a case study of credit scoring datasets. Eur. Sci. J, 13(33), 340–353.
Anggoro, D.A. & Mukti, S.S. (2021). Performance comparison of grid search and random search methods for hyperparameter tuning in extreme gradient boosting algorithm to predict chronic kidney failure.
International Journal of Intelligent Engineering and Systems, 14(6), 198–207.
Aruleba, I., & Sun, Y. (2024). Effective credit risk prediction using ensemble classifiers with model explanation. IEEE Access, 12, 115015–115025.
Breiman, L. (2001). Random forests. Machine learning. 45(1), 5–32.
Bunkhumpornpat, C. & Sinapiromsaran, K. (2014). Safe level graph for majority under-sampling techniques. Chiang Mai Journal of Science, 41(5.2), 1419–1428.
Chawla, N.V., Bowyer, K.W., Hall, L.O. & Kegelmeyer, W.P. (2002). SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16, 321–357.
Chen, M.-C., & Shih-Hsien Huang. (2003). Credit scoring and rejected instances reassigning through evolutionary computation techniques. Expert Systems with Applications, 24, 433–41.
Chen, T., & Guestrin, C. (2016, August). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining (pp. 785-794).
Chen, J., Xie, L., Liu, D.H. & Xiao, J. (2017). Effect analysis of resampling techniques on the performance of customer credit scoring models. DEStech Transactions on Computer Science and Engineering, 12, 375–380.
Dong, H., Liu, R., & Tham, A. W. (2024). Accuracy comparison between five machine learning algorithms for financial risk evaluation. Journal of Risk and Financial Management, 17(2), 50.
Dorogush, A. V., Ershov, V. & Gulin, A. (2018). CatBoost: gradient boosting with categorical features support. arXiv preprint arXiv:1810.11363.
Esenogho, E., Mienye, I. D., Swart, T. G., Aruleba, K., & Obaido, G. (2022). A neural network ensemble with feature engineering for improved credit card fraud detection. IEEE Access, 10, 16400–16407.
Fernández A., García S., Galar M., Prati R.C., Krawczyk B. & Herrera F. (2018). Learning from imbalanced data sets. Springer, Cham Glučina, M., Lorencin, A., Anđelić, N., & Lorencin, I. (2023). Cervical cancer diagnostics using machine learning algorithms and class balancing techniques. Applied Sciences, 13(2), 1061.
Gupta, S.C. & Goel, N. (2023). Predictive modeling and analytics for diabetes using hyperparameter tuned machine learning techniques. Procedia Computer Science, 218, 1257–1269.
Han, X., Cui, R., Lan, Y., Kang, Y., Deng, J., & Jia, N. (2019). A Gaussian mixture model based combined resampling algorithm for classification of imbalanced credit data sets. International Journal of Machine Learning and Cybernetics, 10, 3687–3699.
Hussin Adam Khatir, A. A., & Bee, M. (2022). Machine learning models and data-balancing techniques for credit scoring: What is the best combination?. Risks, 10(9), 169.
Johnson, J., & Khoshgoftaar, T. (2019). Survey on deep learning with class imbalance. Journal of Big Data, 6, 1-54. https://doi.org/10.1186/s40537-019-0192-5.
Kecman, V. (2005). Support vector machines: theory and applications. In L.Wang (Eds). Support Vector Machines-An Introduction (pp. 1–47). Springer, Berlin Heidelberg. https://doi.org/10.1007/b95439.
Krawczyk, B. (2016). Learning from imbalanced data: open challenges and future directions. Progress Artif Intell, 5(4), 221–232.
Kubat, M. & Matwin, S. (1997). Addressing the curse of imbalanced training sets: one-sided selection. In Proceedings of the Fourteenth International Conference on Machine Learning, 97(1), 179–186.
Kumari, P., & Mishra, S. P. (2019). Analysis of credit card fraud detection using fusion classifiers. In Computational Intelligence in Data Mining: Proceedings of the International Conference on CIDM 2017, (pp. 111–122). Springer Singapore.
Louis, L. D., Dunton, A., Saklecha, S. R., Sivakumar, S. N. K., Ahmed, A. S., Sheth, S., & Chang, S. Y. (2024).
Mitigating risk in P2P lending network: enhancing predictions with GenAI and SMOTE. Journal of Networking and Network Applications, 4(2), 48–59.
Milli, M. E. F., Aras, S., & Kocakoç, İ. D. (2024). Investigating the effect of class balancing methods on the performance of machine learning techniques: credit risk application. İzmir Yönetim Dergisi, 5(1), 55–70.
Moguerza, J., & Muñoz, A. (2006). Support Vector Machines with Applications. Statistical Science, 21, 322-336. https://doi.org/10.1214/088342306000000493.Shalev-Shwartz, S., & Ben-David, S. (2014). Understanding machine learning: From theory to algorithms. Cambridge university press.
Mooijman, P., Catal, C., Tekinerdogan, B., Lommen, A., & Blokland, M. (2023). The effects of data balancing approaches: A case study. Applied Soft Computing, 132, 109853.
Oreski, G. (2023). Synthesizing credit data using autoencoders and generative adversarial networks. Knowledge-Based Systems, 274, 110646.
Pall, R., Gauthier, Y., Auer S. & Mowaswes, W. (2023). Predicting drug shortages using pharmacy data and machine learning. Health Care Manag. Sci., 26(3), 395–411. https://doi.org/10.1007/s10729-022-09627-y.
Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., & Gulin, A. (2018). CatBoost: unbiased boosting with categorical features. Advances in Neural Information Processing Systems, 31.
Rofik, R., Aulia, R., Musaadah, K., Ardyani, S. S. F., & Hakim, A. A. (2024). The optimization of credit scoring model using stacking ensemble learning and oversampling techniques. Journal of Information System Exploration and Research, 2(1).
Saini, M., & Susan, S. (2023). Tackling class imbalance in computer vision: a contemporary review. Artificial Intelligence Review, 56(Suppl 1), 1279–1335.
Saleh, R., & Fleyeh, H. (2022). Using supervised machine learning to predict the status of road signs. Transportation research procedia, 62, 221–228.
Seera, M., Lim, C. P., Kumar, A., Dhamotharan, L., & Tan, K. H. (2024). An intelligent payment card fraud detection system. Annals of Operations Research, 334(1), 445–467.
Shen, L., Liu, W., Chen, X., Gu, Q. & Liu, X. (2020). Improving machine learning-based code smell detection via hyper-parameter optimization. 27th Asia-Pacific Software Engineering Conference (APSEC), 276–285.
Thabtah, F., Hammoud, S., Kamalov, F., & Gonsalves, A. (2020). Data imbalance in classification: Experimental evaluation. Inf. Sci., 513, 429-441. https://doi.org/10.1016/j.ins.2019.11.004.
Tomek, I. (1976). Two modifications of CNN. IEEE Transactions on Systems, Man, and Cybernetics, 11, 769–72.
Trivedi, S. K. (2020). A study on credit scoring modeling with different feature selection and machine learning approaches. Technology in Society, 63, 101413.
Waititu, H.W., Arap Koskei, J.K. & Onyango, N.O. (2020). Determinants of under five child mortality from kdhs data: a balanced random survival forests (BRSF) technique. International Journal of Statistics and Applications, 10(5), 118–130.
Wang, A.X., Chukova, S. S. & Nguyen, B.P. (2023). Synthetic minority oversampling using edited displacement-based 𝑘-nearest neighbors. Applied Soft Computing Journal, 148, 1–12.
Wang, H. & Liu, X. (2021). Undersampling bankruptcy prediction: Taiwan bankruptcy data. Plos One, 16(7), 1–17.
Xiao, J., Wang, Y., Chen, J., Xie, L., & Huang, J. (2021). Impact of resampling methods and classification models on the imbalanced credit scoring problems. Information Sciences, 569, 508–526.
Zhang, W., Yang, D., & Zhang, S. (2021). A new hybrid ensemble model with voting-based outlier detection and balanced sampling for credit scoring. Expert Systems with Applications, 174, 114744.
Zhou, Y., Shamsu Uddin, M., Habib, T., Chi, G., & Yuan, K. (2021). Feature selection in credit risk modeling: an international evidence. Economic Research-Ekonomska Istraživanja, 34(1), 3064–3091.
University of California Irvine. Machine Learning Repository. Erişim: 06.11.2024. https://archive.ics.uci.edu/

Makine Öğrenimi Teknikleri ile Kredi Risk Tahmininde Yeniden Örnekleme Yöntemlerinin Karşılaştırılması

Yıl 2024, Cilt: 1 Sayı: 2, 48 - 60, 30.12.2024

Gülçin Kendirkıran , Seyyide Doğan

Öz

Makine öğrenmesi performansını etkileyen önemli hususların başında sınıf dengesizliği sorunları gelmektedir. Gerçek dünya problemlerinde sıklıkla karşılaşılabilen bu durumda öğrenme sürecinde azınlık sınıfın etkisi ihmal edilerek çoğunluk sınıfına doğru kayan yanlı bir tahmin elde edilir. Bu çalışma UCI veri setleri içerisinde yer alan popüler kredi puanlama (Australian ve German) veri seti üzerinde sınıf dengesizliği sorunuyla başa çıkmak için 4 farklı yeniden örnekleme yönteminin değerlendirmesi sunulmaktadır. Banka müşterilerinin kredibilitesinin tahmin edildiği bu problemde iyi ve kötü olarak sınıflandırılan ve dengesiz dağılan iki müşteri sınıfı vardır. Riskli müşterilerin tahmininde Destek Vektör Makineleri (SVM), Rassal Ormanlar (RF), Extra Artırma (XGBoost), CatBoost olmak üzere farklı makine öğrenmesi teknikleri kullanılmış ve bu algoritmalar sınıf dengesizliği sorununu gidermek için Random Oversampling (ROS), Random Undersampling (RUS), SMOTE ve Tomek Bağlantıları yeniden örnekleme yaklaşımları ile birleştirilmiştir. Deneysel sonuçlara göre, yeniden örnekleme yöntemlerinin yöntem performansını iyileştirmede etkili olduğu ve SMOTE yaklaşımı ile CatBoost sınıflandırıcısının; ROS yaklaşımı ile RF sınıflandırıcısının daha iyi sonuçlar ürettiği gözlenmiştir.

Anahtar Kelimeler

Makine Öğrenmesi, Dengesiz veri, Kredi Puanlama

Kaynakça

Alam, T. M., Shaukat, K., Hameed, I. A., Luo, S., Sarwar, M. U., Shabbir, S., ... and Khushi, M. (2020). An investigation of credit card default prediction in the imbalanced datasets. IEEE Access, 8, 201173–201198.
Aly, S., Alfonse, M., Roushdy, M. I., & Salem, A. B. M. (2022). Developing an intelligent system for predicting bankruptcy. Journal of Theoretical and Applied Information Technology, 100(7), 2068–2088.
Anis, M., & Ali, M. (2017). Investigating the performance of smote for class imbalanced learning: a case study of credit scoring datasets. Eur. Sci. J, 13(33), 340–353.
Anggoro, D.A. & Mukti, S.S. (2021). Performance comparison of grid search and random search methods for hyperparameter tuning in extreme gradient boosting algorithm to predict chronic kidney failure.
International Journal of Intelligent Engineering and Systems, 14(6), 198–207.
Aruleba, I., & Sun, Y. (2024). Effective credit risk prediction using ensemble classifiers with model explanation. IEEE Access, 12, 115015–115025.
Breiman, L. (2001). Random forests. Machine learning. 45(1), 5–32.
Bunkhumpornpat, C. & Sinapiromsaran, K. (2014). Safe level graph for majority under-sampling techniques. Chiang Mai Journal of Science, 41(5.2), 1419–1428.
Chawla, N.V., Bowyer, K.W., Hall, L.O. & Kegelmeyer, W.P. (2002). SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16, 321–357.
Chen, M.-C., & Shih-Hsien Huang. (2003). Credit scoring and rejected instances reassigning through evolutionary computation techniques. Expert Systems with Applications, 24, 433–41.
Chen, T., & Guestrin, C. (2016, August). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining (pp. 785-794).
Chen, J., Xie, L., Liu, D.H. & Xiao, J. (2017). Effect analysis of resampling techniques on the performance of customer credit scoring models. DEStech Transactions on Computer Science and Engineering, 12, 375–380.
Dong, H., Liu, R., & Tham, A. W. (2024). Accuracy comparison between five machine learning algorithms for financial risk evaluation. Journal of Risk and Financial Management, 17(2), 50.
Dorogush, A. V., Ershov, V. & Gulin, A. (2018). CatBoost: gradient boosting with categorical features support. arXiv preprint arXiv:1810.11363.
Esenogho, E., Mienye, I. D., Swart, T. G., Aruleba, K., & Obaido, G. (2022). A neural network ensemble with feature engineering for improved credit card fraud detection. IEEE Access, 10, 16400–16407.
Fernández A., García S., Galar M., Prati R.C., Krawczyk B. & Herrera F. (2018). Learning from imbalanced data sets. Springer, Cham Glučina, M., Lorencin, A., Anđelić, N., & Lorencin, I. (2023). Cervical cancer diagnostics using machine learning algorithms and class balancing techniques. Applied Sciences, 13(2), 1061.
Gupta, S.C. & Goel, N. (2023). Predictive modeling and analytics for diabetes using hyperparameter tuned machine learning techniques. Procedia Computer Science, 218, 1257–1269.
Han, X., Cui, R., Lan, Y., Kang, Y., Deng, J., & Jia, N. (2019). A Gaussian mixture model based combined resampling algorithm for classification of imbalanced credit data sets. International Journal of Machine Learning and Cybernetics, 10, 3687–3699.
Hussin Adam Khatir, A. A., & Bee, M. (2022). Machine learning models and data-balancing techniques for credit scoring: What is the best combination?. Risks, 10(9), 169.
Johnson, J., & Khoshgoftaar, T. (2019). Survey on deep learning with class imbalance. Journal of Big Data, 6, 1-54. https://doi.org/10.1186/s40537-019-0192-5.
Kecman, V. (2005). Support vector machines: theory and applications. In L.Wang (Eds). Support Vector Machines-An Introduction (pp. 1–47). Springer, Berlin Heidelberg. https://doi.org/10.1007/b95439.
Krawczyk, B. (2016). Learning from imbalanced data: open challenges and future directions. Progress Artif Intell, 5(4), 221–232.
Kubat, M. & Matwin, S. (1997). Addressing the curse of imbalanced training sets: one-sided selection. In Proceedings of the Fourteenth International Conference on Machine Learning, 97(1), 179–186.
Kumari, P., & Mishra, S. P. (2019). Analysis of credit card fraud detection using fusion classifiers. In Computational Intelligence in Data Mining: Proceedings of the International Conference on CIDM 2017, (pp. 111–122). Springer Singapore.
Louis, L. D., Dunton, A., Saklecha, S. R., Sivakumar, S. N. K., Ahmed, A. S., Sheth, S., & Chang, S. Y. (2024).
Mitigating risk in P2P lending network: enhancing predictions with GenAI and SMOTE. Journal of Networking and Network Applications, 4(2), 48–59.
Milli, M. E. F., Aras, S., & Kocakoç, İ. D. (2024). Investigating the effect of class balancing methods on the performance of machine learning techniques: credit risk application. İzmir Yönetim Dergisi, 5(1), 55–70.
Moguerza, J., & Muñoz, A. (2006). Support Vector Machines with Applications. Statistical Science, 21, 322-336. https://doi.org/10.1214/088342306000000493.Shalev-Shwartz, S., & Ben-David, S. (2014). Understanding machine learning: From theory to algorithms. Cambridge university press.
Mooijman, P., Catal, C., Tekinerdogan, B., Lommen, A., & Blokland, M. (2023). The effects of data balancing approaches: A case study. Applied Soft Computing, 132, 109853.
Oreski, G. (2023). Synthesizing credit data using autoencoders and generative adversarial networks. Knowledge-Based Systems, 274, 110646.
Pall, R., Gauthier, Y., Auer S. & Mowaswes, W. (2023). Predicting drug shortages using pharmacy data and machine learning. Health Care Manag. Sci., 26(3), 395–411. https://doi.org/10.1007/s10729-022-09627-y.
Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., & Gulin, A. (2018). CatBoost: unbiased boosting with categorical features. Advances in Neural Information Processing Systems, 31.
Rofik, R., Aulia, R., Musaadah, K., Ardyani, S. S. F., & Hakim, A. A. (2024). The optimization of credit scoring model using stacking ensemble learning and oversampling techniques. Journal of Information System Exploration and Research, 2(1).
Saini, M., & Susan, S. (2023). Tackling class imbalance in computer vision: a contemporary review. Artificial Intelligence Review, 56(Suppl 1), 1279–1335.
Saleh, R., & Fleyeh, H. (2022). Using supervised machine learning to predict the status of road signs. Transportation research procedia, 62, 221–228.
Seera, M., Lim, C. P., Kumar, A., Dhamotharan, L., & Tan, K. H. (2024). An intelligent payment card fraud detection system. Annals of Operations Research, 334(1), 445–467.
Shen, L., Liu, W., Chen, X., Gu, Q. & Liu, X. (2020). Improving machine learning-based code smell detection via hyper-parameter optimization. 27th Asia-Pacific Software Engineering Conference (APSEC), 276–285.
Thabtah, F., Hammoud, S., Kamalov, F., & Gonsalves, A. (2020). Data imbalance in classification: Experimental evaluation. Inf. Sci., 513, 429-441. https://doi.org/10.1016/j.ins.2019.11.004.
Tomek, I. (1976). Two modifications of CNN. IEEE Transactions on Systems, Man, and Cybernetics, 11, 769–72.
Trivedi, S. K. (2020). A study on credit scoring modeling with different feature selection and machine learning approaches. Technology in Society, 63, 101413.
Waititu, H.W., Arap Koskei, J.K. & Onyango, N.O. (2020). Determinants of under five child mortality from kdhs data: a balanced random survival forests (BRSF) technique. International Journal of Statistics and Applications, 10(5), 118–130.
Wang, A.X., Chukova, S. S. & Nguyen, B.P. (2023). Synthetic minority oversampling using edited displacement-based 𝑘-nearest neighbors. Applied Soft Computing Journal, 148, 1–12.
Wang, H. & Liu, X. (2021). Undersampling bankruptcy prediction: Taiwan bankruptcy data. Plos One, 16(7), 1–17.
Xiao, J., Wang, Y., Chen, J., Xie, L., & Huang, J. (2021). Impact of resampling methods and classification models on the imbalanced credit scoring problems. Information Sciences, 569, 508–526.
Zhang, W., Yang, D., & Zhang, S. (2021). A new hybrid ensemble model with voting-based outlier detection and balanced sampling for credit scoring. Expert Systems with Applications, 174, 114744.
Zhou, Y., Shamsu Uddin, M., Habib, T., Chi, G., & Yuan, K. (2021). Feature selection in credit risk modeling: an international evidence. Economic Research-Ekonomska Istraživanja, 34(1), 3064–3091.
University of California Irvine. Machine Learning Repository. Erişim: 06.11.2024. https://archive.ics.uci.edu/

Toplam 47 adet kaynakça vardır.

Ayrıntılar

Birincil Dil	Türkçe
Konular	Strateji, Yönetim ve Örgütsel Davranış (Diğer)
Bölüm	Araştırma Makalesi
Yazarlar	Gülçin Kendirkıran 0000-0003-3146-0192 Seyyide Doğan
Yayımlanma Tarihi	30 Aralık 2024
Gönderilme Tarihi	13 Kasım 2024
Kabul Tarihi	20 Aralık 2024
Yayımlandığı Sayı	Yıl 2024 Cilt: 1 Sayı: 2

Kaynak Göster

APA	Kendirkıran, G., & Doğan, S. (2024). Makine Öğrenimi Teknikleri ile Kredi Risk Tahmininde Yeniden Örnekleme Yöntemlerinin Karşılaştırılması. Söke İşletme Fakültesi Dergisi, 1(2), 48-60.

Kapak Resmi İndir

Makale Dosyaları

Tam Metin