Çoklu Doğrusal Bağlantılı Nadir Olayların Modellenmesinde Lasso ve Ridge Regresyon ile Boosting Algoritmalarının Performans Karşılaştırması

Olcay Alpay

doi:10.33484/sinopfbd.1434260

Research Article

Performance Comparison of Lasso and Ridge Regression and Boosting Algorithms for Modeling Rare Events with Multicollinearity

Year 2024, Volume: 9 Issue: 1, 154 - 166, 29.06.2024

Olcay Alpay

https://doi.org/10.33484/sinopfbd.1434260

Cited By: 1

Abstract

This study examines the issues of rarity and multicollinearity in machine learning techniques used to model binary events. Multicollinearity (MC) is the presence of strong linear dependencies among independent variables, which poses a problem. In the context of the data being studied, the existence of multicollinearity leads to undesired consequences such as an enlargement of the variances of the regression coefficients. This study presents a simulation comparing the performance of algorithms in modelling multicollinear and rare events. Regularization and scaling techniques such as Lasso and Ridge Regression, as well as Boosting algorithms like GradientBoost, XGBoost, LightGBM, and AdaBoost are utilized. The impact of resampling methods to reduce data imbalance is also investigated using performance metrics such as Mean Squared Error (MSE), R^2, Precision (Prec), Recall (Rec) and AUC values, along with ROC curves. The results help to determine the appropriate method for modelling rare events with multicollinearity and provide insight into the performance of Lasso, Ridge and Boosting algorithms.

Keywords

Lasso regression , Ridge regression , Boosting algorithms , performance metrics , resampling techniques

References

Bayman, O. E., & Dexter, F. (2021). Multicollinearity in logistic regression models. Anesthesi̇a & Analgesi̇a, 133(2), 362-365. https://doi: 10.1213/ane.0000000000005593
King, G., & Zeng, L. (2001). Logistic regression in rare events data. Poli̇ti̇cal Analysi̇s, 9(2), 137-163. https://doi:10.1093/oxfordjournals.pan.a004868
Maalouf, M., & Trafalis, T. B. (2011). Robust weighted kernel logistic regression in imbalanced and rare events data. Computati̇onal Stati̇sti̇cs & Data Analysi̇s, 55(1), 168-183. https://doi:10.1016/j.csda.2010.06.014
Shrivastava, S., Jeyanthi, P. M., & Singh, S. (2020). Failure prediction of Indian Banks using SMOTE, Lasso regression, bagging and Boosting. Cogent Economics & Finance, 8(1), 1729569. https://doi.org/10.1080/23322039.2020.1729569
Rochayani, M. Y., Sa'adah, U., & Astuti, A. B. (2020). Finding biomarkers from a high-dimensional imbalanced dataset using the hybrid method of random undersampling and lasso. Comtech: Computer, Mathemati̇cs and Engi̇neeri̇ng Appli̇cati̇ons, 11(2), 75-81. https://doi:10.21512/comtech.v11i2.6452
Cahyana, N., Khomsah, S., & Aribowo, A. S. (2019). Improving imbalanced dataset classification using oversampling and gradient Boosting [Bildiri sunumu]. 5th international conference on science in information technology (ICSITech), China.
Tanha, J., Abdi, Y., Samadi, N., Razzaghi, N., & Asadpour, M. (2020). Boosting methods for multi-class imbalanced data classification: an experimental review. Journal of Big Data, 7, 1-47. https://doi.org/10.1186/s40537‑020‑00349‑y
Ashraf, M. T., Dey, K., & Mishra, S. (2023). Identification of high-risk roadway segments for wrong-way driving crash using rare event modeling and data augmentation techniques. Accident Analysis & Prevention, 181, 106933. https://doi.org/10.1016/j.aap.2022.106933
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1), 267-288.
Göv, A., & Kapkara Kaya, S. (2023). Türkiye örneğinde çevresel kalitenin belirleyicileri: lasso yaklaşımı. Pamukkale Üniversitesi Sosyal Bilimler Enstitüsü Dergisi, (54), 25-37. https://doi:10.30794/pausbed.1097352
Yüzbaşı, B., & Pala, M. (2022). Ridge regresyon parametre seçimi: Türkiye’nin doğrudan yabancı yatırım örneği. İstatistikçiler Dergisi: İstatistik ve Aktüerya, 15(1), 1-18.
Mahesh, B. (2020). Machine learning algorithms - a review. International Journal of Science and Research (IJSR), 9(1), 381-386. https://doi: 10.21275/ART20203995
Friedman, J. H. (2002). Stochastic gradient Boosting. Computational Statistics & Data Analysis, 38(4), 367-378.
Ali, Z. A., Abduljabbar, Z. H., Taher, H. A., Sallow, A. B., & Almufti, S. M. (2023). Exploring the power of extreme gradient Boosting algorithm in machine learning: A review. Academic Journal of Nawroz University, 12(2), 320-334.
Tyralis, H., & Papacharalampous, G. (2021). Boosting algorithms in energy research: A systematic review. Neural Computing and Applications, 33(21), 14101-14117. https://doi.org/10.1007/s00521-021-05995-8
Li, S., & Zhang, X. (2020). Research on orthopedic auxiliary classification and prediction model based on XGBoost algorithm. Neural Computing and Applications, 32(7), 1971-1979. https://doi.org/10.1007/s00521-019-04378-4
Wang, D. N., Li, L., & Zhao, D. (2022). Corporate finance risk prediction based on LightGBM. Information Sciences, 602, 259-268. https://doi.org/10.1016/j.ins.2022.04.058
Gu, Q., Sun, W., Li, X., Jiang, S., & Tian, J. (2023). A new ensemble classification approach based on Rotation Forest and LightGBM. Neural Computing and Applications, 35(15), 11287-11308. https://doi.org/10.1007/s00521-023-08297-3
Ying, C., Qi-Guang, M., Jia-Chen, L., ve Lin, G. (2013). Advance and prospects of AdaBoost algorithm. Acta Automatica Sinica, 39(6), 745-758. https://doi.org/10.1016/S1874-1029(13)60052-X
Hoens, T. R., & Chawla, N. V. (2013). Imbalanced datasets: from sampling to classifiers. H. He & Y. Ma (Ed.), Imbalanced learning: Foundations, algorithms, and applications, (s.43-59). Wi̇ley Onli̇ne Li̇brary.
Wang, J., Xu, M., Wang, H., & Zhang, J. (2006). Classification of imbalanced data by using the SMOTE algorithm and locally linear embedding [Bildiri sunumu]. In 8th international conference on signal processing (IEEE), China.
Birla, S., Kohli, K., & Dutta, A. (2016). Machine learning on imbalanced data in credit risk [Bildiri sunumu]. In 2016 IEEE 7th annual information technology, electronics and mobile communication conference (IEMCON), Canada.
Wang, Z. H. E., Wu, C., Zheng, K., Niu, X., & Wang, X. (2019). SMOTETomek-based resampling for personality recognition. IEEE access, 7, 129678-129689. https://doi:10.1109/ACCESS.2019.2940061
Werner de Vargas, V., Schneider Aranda, J. A., dos Santos Costa, R., da Silva Pereira, P. R., & Victória Barbosa, J. L. (2023). Imbalanced data preprocessing techniques for machine learning: a systematic mapping study. Knowledge and Information Systems, 65(1), 31-57. https://doi.org/10.1007/s10115-022-01772-8
Mandrekar, J. N. (2010). Receiver operating characteristic curve in diagnostic test assessment. Journal of Thoracic Oncology, 5(9), 1315-1316. https://doi.org/10.1097/JTO.0b013e3181ec173d
Keçeoğlu, Ç. R., Gelbal, S., & Doğan, N. (2016). Roc eğrisi yöntemi ile kesme puaninin belirlenmesi. The Journal of Academic Social Science Studies, 50(2), 553-562. http://dx.doi.org/10.9761/JASSS3564
Oommen, T., Baise, L. G., & Vogel, R. M. (2011). Sampling bias and class imbalance in maximum-likelihood logistic regression. Mathematical Geosciences, 43, 99-120. https://doi10.1007/s11004-010-9311-8
Huang, J., & Ling, C. X. (2005). Using AUC and accuracy in evaluating learning algorithms. IEEE Transactions on knowledge and Data Engineering, 17(3), 299-310.
LemaÃŽtre, G., Nogueira, F., & Aridas, C. K. (2017). Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. Journal of Machine Learning Research, 18(17), 1-5.

Çoklu Doğrusal Bağlantılı Nadir Olayların Modellenmesinde Lasso ve Ridge Regresyon ile Boosting Algoritmalarının Performans Karşılaştırması

Year 2024, Volume: 9 Issue: 1, 154 - 166, 29.06.2024

Olcay Alpay

https://doi.org/10.33484/sinopfbd.1434260

Cited By: 1

Abstract

Bu çalışma, iki durumlu olayları modellemek için kullanılan makine öğrenmesi tekniklerinde karşılaşılan nadirlik ve “çoklu doğrusal bağlantı” ya da sadece “çoklu bağlantı” olarak tanımlanan sorunu ele alınmaktadır. Çoklu doğrusal bağlantı (ÇDB), bağımsız değişkenler arasında bir ya da birden fazla kuvvetli doğrusal bağımlılık olma durumudur ve bir sorun olarak ortaya çıkar. Üzerinde çalışılan veri içerisinde çoklu doğrusal bağlantı probleminin var olması regresyon katsayılarının varyanslarının büyümesi gibi olumsuz bir sonuca sebebiyet verir. Bu çalışmada, Lasso ve Ridge Regresyon ile GradientBoost, XGBoost, LightGBM ve AdaBoost gibi artırma algoritmaları içeren düzenleme ve ölçeklendirme tekniklerinin, çoklu doğrusal bağlantılı nadir olayların modellenmesinde, algoritmaların performanslarını karşılaştırmak için detaylı bir simülasyon çalışması sunulmaktadır. Simülasyon çalışmasında, verideki dengesizliği ortadan kaldırmak amacıyla yeniden örnekleme yöntemleri kullanılarak sonuçlara etkisi Hata Kareler Ortalaması (HKO), R^2, Hassasiyet (Precision-Prec), Duyarlılık (Recall-Rec) ve Eğri Altında Kalan Alan (Area Under the Curve-AUC) gibi performans metrikleri ve İşlem Karakteristik Eğrisi (Receiver Operating Characteristic- ROC) grafikleri ile araştırılmaktadır. Sonuçlar Lasso, Ridge ve Boosting algoritmalarının ÇDB’ya sahip nadir olayların modellenmesinde hangi yöntemin uygun olduğunu belirlemek açısından katkı sunmaktadır.

Keywords

Lasso regresyon , Ridge regresyon , Boosting algoritmaları , performans metrikleri , yeniden örnekleme teknikleri

References

Bayman, O. E., & Dexter, F. (2021). Multicollinearity in logistic regression models. Anesthesi̇a & Analgesi̇a, 133(2), 362-365. https://doi: 10.1213/ane.0000000000005593
King, G., & Zeng, L. (2001). Logistic regression in rare events data. Poli̇ti̇cal Analysi̇s, 9(2), 137-163. https://doi:10.1093/oxfordjournals.pan.a004868
Maalouf, M., & Trafalis, T. B. (2011). Robust weighted kernel logistic regression in imbalanced and rare events data. Computati̇onal Stati̇sti̇cs & Data Analysi̇s, 55(1), 168-183. https://doi:10.1016/j.csda.2010.06.014
Shrivastava, S., Jeyanthi, P. M., & Singh, S. (2020). Failure prediction of Indian Banks using SMOTE, Lasso regression, bagging and Boosting. Cogent Economics & Finance, 8(1), 1729569. https://doi.org/10.1080/23322039.2020.1729569
Rochayani, M. Y., Sa'adah, U., & Astuti, A. B. (2020). Finding biomarkers from a high-dimensional imbalanced dataset using the hybrid method of random undersampling and lasso. Comtech: Computer, Mathemati̇cs and Engi̇neeri̇ng Appli̇cati̇ons, 11(2), 75-81. https://doi:10.21512/comtech.v11i2.6452
Cahyana, N., Khomsah, S., & Aribowo, A. S. (2019). Improving imbalanced dataset classification using oversampling and gradient Boosting [Bildiri sunumu]. 5th international conference on science in information technology (ICSITech), China.
Tanha, J., Abdi, Y., Samadi, N., Razzaghi, N., & Asadpour, M. (2020). Boosting methods for multi-class imbalanced data classification: an experimental review. Journal of Big Data, 7, 1-47. https://doi.org/10.1186/s40537‑020‑00349‑y
Ashraf, M. T., Dey, K., & Mishra, S. (2023). Identification of high-risk roadway segments for wrong-way driving crash using rare event modeling and data augmentation techniques. Accident Analysis & Prevention, 181, 106933. https://doi.org/10.1016/j.aap.2022.106933
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1), 267-288.
Göv, A., & Kapkara Kaya, S. (2023). Türkiye örneğinde çevresel kalitenin belirleyicileri: lasso yaklaşımı. Pamukkale Üniversitesi Sosyal Bilimler Enstitüsü Dergisi, (54), 25-37. https://doi:10.30794/pausbed.1097352
Yüzbaşı, B., & Pala, M. (2022). Ridge regresyon parametre seçimi: Türkiye’nin doğrudan yabancı yatırım örneği. İstatistikçiler Dergisi: İstatistik ve Aktüerya, 15(1), 1-18.
Mahesh, B. (2020). Machine learning algorithms - a review. International Journal of Science and Research (IJSR), 9(1), 381-386. https://doi: 10.21275/ART20203995
Friedman, J. H. (2002). Stochastic gradient Boosting. Computational Statistics & Data Analysis, 38(4), 367-378.
Ali, Z. A., Abduljabbar, Z. H., Taher, H. A., Sallow, A. B., & Almufti, S. M. (2023). Exploring the power of extreme gradient Boosting algorithm in machine learning: A review. Academic Journal of Nawroz University, 12(2), 320-334.
Tyralis, H., & Papacharalampous, G. (2021). Boosting algorithms in energy research: A systematic review. Neural Computing and Applications, 33(21), 14101-14117. https://doi.org/10.1007/s00521-021-05995-8
Li, S., & Zhang, X. (2020). Research on orthopedic auxiliary classification and prediction model based on XGBoost algorithm. Neural Computing and Applications, 32(7), 1971-1979. https://doi.org/10.1007/s00521-019-04378-4
Wang, D. N., Li, L., & Zhao, D. (2022). Corporate finance risk prediction based on LightGBM. Information Sciences, 602, 259-268. https://doi.org/10.1016/j.ins.2022.04.058
Gu, Q., Sun, W., Li, X., Jiang, S., & Tian, J. (2023). A new ensemble classification approach based on Rotation Forest and LightGBM. Neural Computing and Applications, 35(15), 11287-11308. https://doi.org/10.1007/s00521-023-08297-3
Ying, C., Qi-Guang, M., Jia-Chen, L., ve Lin, G. (2013). Advance and prospects of AdaBoost algorithm. Acta Automatica Sinica, 39(6), 745-758. https://doi.org/10.1016/S1874-1029(13)60052-X
Hoens, T. R., & Chawla, N. V. (2013). Imbalanced datasets: from sampling to classifiers. H. He & Y. Ma (Ed.), Imbalanced learning: Foundations, algorithms, and applications, (s.43-59). Wi̇ley Onli̇ne Li̇brary.
Wang, J., Xu, M., Wang, H., & Zhang, J. (2006). Classification of imbalanced data by using the SMOTE algorithm and locally linear embedding [Bildiri sunumu]. In 8th international conference on signal processing (IEEE), China.
Birla, S., Kohli, K., & Dutta, A. (2016). Machine learning on imbalanced data in credit risk [Bildiri sunumu]. In 2016 IEEE 7th annual information technology, electronics and mobile communication conference (IEMCON), Canada.
Wang, Z. H. E., Wu, C., Zheng, K., Niu, X., & Wang, X. (2019). SMOTETomek-based resampling for personality recognition. IEEE access, 7, 129678-129689. https://doi:10.1109/ACCESS.2019.2940061
Werner de Vargas, V., Schneider Aranda, J. A., dos Santos Costa, R., da Silva Pereira, P. R., & Victória Barbosa, J. L. (2023). Imbalanced data preprocessing techniques for machine learning: a systematic mapping study. Knowledge and Information Systems, 65(1), 31-57. https://doi.org/10.1007/s10115-022-01772-8
Mandrekar, J. N. (2010). Receiver operating characteristic curve in diagnostic test assessment. Journal of Thoracic Oncology, 5(9), 1315-1316. https://doi.org/10.1097/JTO.0b013e3181ec173d
Keçeoğlu, Ç. R., Gelbal, S., & Doğan, N. (2016). Roc eğrisi yöntemi ile kesme puaninin belirlenmesi. The Journal of Academic Social Science Studies, 50(2), 553-562. http://dx.doi.org/10.9761/JASSS3564
Oommen, T., Baise, L. G., & Vogel, R. M. (2011). Sampling bias and class imbalance in maximum-likelihood logistic regression. Mathematical Geosciences, 43, 99-120. https://doi10.1007/s11004-010-9311-8
Huang, J., & Ling, C. X. (2005). Using AUC and accuracy in evaluating learning algorithms. IEEE Transactions on knowledge and Data Engineering, 17(3), 299-310.
LemaÃŽtre, G., Nogueira, F., & Aridas, C. K. (2017). Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. Journal of Machine Learning Research, 18(17), 1-5.

There are 29 citations in total.

Details

Primary Language	Turkish
Subjects	Statistics (Other)
Journal Section	Research Article
Authors	Olcay Alpay 0000-0003-1446-0801
Submission Date	February 9, 2024
Acceptance Date	May 21, 2024
Publication Date	June 29, 2024
Published in Issue	Year 2024 Volume: 9 Issue: 1

Cite

APA	Alpay, O. (2024). Çoklu Doğrusal Bağlantılı Nadir Olayların Modellenmesinde Lasso ve Ridge Regresyon ile Boosting Algoritmalarının Performans Karşılaştırması. Sinop Üniversitesi Fen Bilimleri Dergisi, 9(1), 154-166. https://doi.org/10.33484/sinopfbd.1434260

Cited By

A Universal Model for Debt Transparency Based on The Forecast of The Ratio of Debt to GDP

Mehmet Akif Ersoy Üniversitesi İktisadi ve İdari Bilimler Fakültesi Dergisi

https://doi.org/10.30798/makuiibf.1490441

Article Files

Full Text

Articles published in Sinopjns are licensed under CC BY-NC 4.0.