A comparison of traditional and state-of-the-art machine learning algorithms for type 2 diabetes prediction

Naciye Nur Arslan; Durmuş Özdemir

Research Article

A comparison of traditional and state-of-the-art machine learning algorithms for type 2 diabetes prediction

Year 2024, Issue: 006, 1 - 11, 30.04.2024

Abstract

This research investigates the use of machine learning algorithms for early detection of diabetes. Due to its global prevalence and significant impact on health, timely identification of diabetes is crucial for effective treatment. In this study, machine learning models including Gradient Boosting Machines, Extreme Gradient Boosting, Light gradient-boosting machine, Categorical Boosting, k-Nearest Neighbors, Random Forest, Ridge Classifier, Logistic Regression, Gaussian Naive Bayes, and Decision Tree are utilized to assess their capabilities in diabetes diagnosis. The primary aim is to train these models to distinguish between individuals with diabetes and those without, using relevant features from the dataset. Since the classes in the dataset are imbalanced, the SMOTE technique is applied to improve model performance. Categorical Boosting achieved the highest accuracy rate of 90.05%, making it the most successful model. By systematically evaluating the performance of these prominent machine learning models, valuable insights can be gathered regarding their ability to recognize complex patterns indicative of diabetes. As a result, healthcare professionals and researchers can leverage this newfound understanding to develop more accurate and effective diagnostic tools, enabling early intervention and subsequently improving the overall quality of life for individuals affected by diabetes.

Keywords

diabetes , machine learning , ensemble learning , boosting , bagging , catboost , xgboost , lightgbm

References

[1] A. D. Deshpande, M. Harris-Hayes, and M. Schootman, “Epidemiology of diabetes and diabetes-related complications,” Phys. Ther., vol. 88, no. 11, pp. 1254-1264, Nov. 2008, doi: https://doi.org/10.2522/ptj.20080020.
[2] H. D.McIntyre, P. Catalano, C. Zhang, G. Desoye, E. R. Mathiesen, and P. Damm, “Gestational diabetes mellitus,” Natur. Rev. Dis. Prim., vol. 5, no. 1, pp. 47, Jul. 2019, doi: https://doi.org/10.1038/s41572-019-0098-8.
[3] İ. Akgül, Ö. Çağrı Yavuz, and U. Yavuz, “Deep Learning Based Models for Detection of Diabetic Retinopathy,” Tehničk glasnik, vol. 17, no. 4, pp. 581-587, Dec. 2023, doi: https://doi.org/10.31803/tg-20220905123827.
[4] N. Yuvaraj and K.R. SriPreethaa, “Diabetes prediction in healthcare systems using machine learning algorithms on Hadoop cluster,” Clust. Comp., vol. 22, no. 1, pp. 1-9, Jan. 2019, doi: https://doi.org/10.1007/s10586-017-1532-x.
[5] Z. Xie, O. Nikolayeva, J. Luo, and D. Li, “Peer reviewed: building risk prediction models for type 2 diabetes using machine learning techniques,” Prev. Chro. Dis., vol. 16, Sep. 2019, doi: 10.5888/pcd16.190109.
[6] S. Wei, X. Zhao, and C. Miao, “A comprehensive exploration to the machine learning techniques for diabetes identification,” in 2018 IEEE 4th World Forum on Int. of Things, 2018, pp. 291-295, doi: 10.1109/WF-IoT.2018.8355130.
[7] A. Yahyaoui, A. Jamil, J. Rasheed, and M. Yesiltepe, “A decision support system for diabetes prediction using machine learning and deep learning techniques,” in 2019 1st Inter. Inform. and Soft. Eng. Conf., 2019, pp. 1-4, doi: 10.1109/UBMYK48245.2019.8965556.
[8] Diabetes Health Indicators Dataset, https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset (accessed December 12, 2023).
[9] M. Crowther, W. Lim, and M. A. Crowther, “Systematic review and meta-analysis methodology,” The Jour. of the Amer. Soc. of Hemat., vol. 116, no. 17, pp. 3140-3146, Oct. 2010, doi: https://doi.org/10.1182/blood-2010-05-280883.
[10] S. Pandey, “Principles of correlation and regression analysis,” Jour. of the prac. of cardio. Sci., vol. 6, no. 1, pp. 7-11, Apr. 2020, doi: 10.4103/jpcs.jpcs_2_20.
[11] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: synthetic minority over-sampling technique,” Jour. of artif. Intel. Res., vol. 16, pp. 321-357, Jun. 2002, doi: https://doi.org/10.1613/jair.953.
[12] X. Dong, Z. Yu, W. Cao, Y. Shi, and Q. Ma, “A survey on ensemble learning,” Front. of Comp. Sci., vol. 14, pp. 241-258, Apr. 2020, doi: https://doi.org/10.1007/s11704-019-8208-z.
[13] L. Breiman, “Bagging predictors,” Mach. Learn., vol. 24, pp. 123-140, Aug. 1996, doi: https://doi.org/10.1007/BF00058655.
[14] R. E. Schapire, “A brief introduction to boosting,” Ijcai, Vol. 99, No. 999, pp. 1401-1406, 1999.
[15] S. González, S. García, J. Del Ser, L. Rokach, and F. Herrera, “A practical tutorial on bagging and boosting based ensembles for machine learning: Algorithms, software tools, performance study, practical perspectives and opportunities,” Infor. Fus., vol. 64, pp. 205-237, Dec. 2020, doi: https://doi.org/10.1016/j.inffus.2020.07.007.
[16] D. R. Cox, “The regression analysis of binary sequences,” Jour. of the Roy. Stat. Soc. Ser. B: Stat. Meth., vol. 20, no. 2, pp. 215-232, Jul. 1958, doi: https://doi.org/10.1111/j.2517-6161.1958.tb00292.x.
[17] S. Jayachitra, and A. Prasanth, “Multi-feature analysis for automated brain stroke classification using weighted Gaussian naïve Bayes classifier,” Jour. of Circ. Sys. and Comp., vol. 30, no. 10, pp. 2150178, 2021, doi: https://doi.org/10.1142/S0218126621501784.
[18] C. Kingsford and S. L. Salzberg, “What are decision trees?,” Nat. Biotech., vol. 26, no. 9, pp. 1011-1013, Sep. 2008, doi: https://doi.org/10.1038/nbt0908-1011.
[19] S. Uddin, I. Haque, H. Lu, M. A. Moni, and E. Gide, “Comparative performance analysis of K-nearest neighbor (KNN) algorithm and its different variants for disease prediction,” Sci. Rep., vol. 12, pp. 6256, Apr. 2022, doi: https://doi.org/10.1038/s41598-022-10358-x.
[20] S. Priyadarshinee and M. Panda, “Cardiac disease prediction using smote and machine learning classifiers,” Jour. of Pharma. Neg. Res., vol. 13, no.8, pp. 856-862, Nov. 2022, doi: https://doi.org/10.47750/pnr.2022.13.S08.108.
[21] L. Breiman, “Random forests,” Mac. Learn., vol. 45, pp. 5-32, Oct. 2001, doi: https://doi.org/10.1023/A:1010933404324.
[22] A. Natekin and A. Knoll, “Gradient boosting machines, a tutorial,” Front. in Neurorob., vol. 7, pp. 21, Dec. 2013, doi: https://doi.org/10.3389/fnbot.2013.00021.
[23] T. Chen, T. He, M. Benesty, V. Khotilovich, Y. Tang, H. Cho, and T. Zhou, “Xgboost: extreme gradient boosting,” R package version 0.4-2, vol. 1, no. 4, pp. 1-4, 2015.
[24] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, and T. Y. Liu, “Lightgbm: A highly efficient gradient boosting decision tree,” Adv. in Neur. Info. Process. Sys., 30, 2017.
[25] L. Prokhorenkova, G. Gusev, A. Vorobev, A. V. Dorogush, and A. Gulin, “CatBoost: unbiased boosting with categorical features,” Adv. in Neur. Info. Process. Sys., 31, 2018.

Year 2024, Issue: 006, 1 - 11, 30.04.2024

Naciye Nur Arslan , Durmuş Özdemir

Abstract

References

[1] A. D. Deshpande, M. Harris-Hayes, and M. Schootman, “Epidemiology of diabetes and diabetes-related complications,” Phys. Ther., vol. 88, no. 11, pp. 1254-1264, Nov. 2008, doi: https://doi.org/10.2522/ptj.20080020.
[2] H. D.McIntyre, P. Catalano, C. Zhang, G. Desoye, E. R. Mathiesen, and P. Damm, “Gestational diabetes mellitus,” Natur. Rev. Dis. Prim., vol. 5, no. 1, pp. 47, Jul. 2019, doi: https://doi.org/10.1038/s41572-019-0098-8.
[3] İ. Akgül, Ö. Çağrı Yavuz, and U. Yavuz, “Deep Learning Based Models for Detection of Diabetic Retinopathy,” Tehničk glasnik, vol. 17, no. 4, pp. 581-587, Dec. 2023, doi: https://doi.org/10.31803/tg-20220905123827.
[4] N. Yuvaraj and K.R. SriPreethaa, “Diabetes prediction in healthcare systems using machine learning algorithms on Hadoop cluster,” Clust. Comp., vol. 22, no. 1, pp. 1-9, Jan. 2019, doi: https://doi.org/10.1007/s10586-017-1532-x.
[5] Z. Xie, O. Nikolayeva, J. Luo, and D. Li, “Peer reviewed: building risk prediction models for type 2 diabetes using machine learning techniques,” Prev. Chro. Dis., vol. 16, Sep. 2019, doi: 10.5888/pcd16.190109.
[6] S. Wei, X. Zhao, and C. Miao, “A comprehensive exploration to the machine learning techniques for diabetes identification,” in 2018 IEEE 4th World Forum on Int. of Things, 2018, pp. 291-295, doi: 10.1109/WF-IoT.2018.8355130.
[7] A. Yahyaoui, A. Jamil, J. Rasheed, and M. Yesiltepe, “A decision support system for diabetes prediction using machine learning and deep learning techniques,” in 2019 1st Inter. Inform. and Soft. Eng. Conf., 2019, pp. 1-4, doi: 10.1109/UBMYK48245.2019.8965556.
[8] Diabetes Health Indicators Dataset, https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset (accessed December 12, 2023).
[9] M. Crowther, W. Lim, and M. A. Crowther, “Systematic review and meta-analysis methodology,” The Jour. of the Amer. Soc. of Hemat., vol. 116, no. 17, pp. 3140-3146, Oct. 2010, doi: https://doi.org/10.1182/blood-2010-05-280883.
[10] S. Pandey, “Principles of correlation and regression analysis,” Jour. of the prac. of cardio. Sci., vol. 6, no. 1, pp. 7-11, Apr. 2020, doi: 10.4103/jpcs.jpcs_2_20.
[11] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: synthetic minority over-sampling technique,” Jour. of artif. Intel. Res., vol. 16, pp. 321-357, Jun. 2002, doi: https://doi.org/10.1613/jair.953.
[12] X. Dong, Z. Yu, W. Cao, Y. Shi, and Q. Ma, “A survey on ensemble learning,” Front. of Comp. Sci., vol. 14, pp. 241-258, Apr. 2020, doi: https://doi.org/10.1007/s11704-019-8208-z.
[13] L. Breiman, “Bagging predictors,” Mach. Learn., vol. 24, pp. 123-140, Aug. 1996, doi: https://doi.org/10.1007/BF00058655.
[14] R. E. Schapire, “A brief introduction to boosting,” Ijcai, Vol. 99, No. 999, pp. 1401-1406, 1999.
[15] S. González, S. García, J. Del Ser, L. Rokach, and F. Herrera, “A practical tutorial on bagging and boosting based ensembles for machine learning: Algorithms, software tools, performance study, practical perspectives and opportunities,” Infor. Fus., vol. 64, pp. 205-237, Dec. 2020, doi: https://doi.org/10.1016/j.inffus.2020.07.007.
[16] D. R. Cox, “The regression analysis of binary sequences,” Jour. of the Roy. Stat. Soc. Ser. B: Stat. Meth., vol. 20, no. 2, pp. 215-232, Jul. 1958, doi: https://doi.org/10.1111/j.2517-6161.1958.tb00292.x.
[17] S. Jayachitra, and A. Prasanth, “Multi-feature analysis for automated brain stroke classification using weighted Gaussian naïve Bayes classifier,” Jour. of Circ. Sys. and Comp., vol. 30, no. 10, pp. 2150178, 2021, doi: https://doi.org/10.1142/S0218126621501784.
[18] C. Kingsford and S. L. Salzberg, “What are decision trees?,” Nat. Biotech., vol. 26, no. 9, pp. 1011-1013, Sep. 2008, doi: https://doi.org/10.1038/nbt0908-1011.
[19] S. Uddin, I. Haque, H. Lu, M. A. Moni, and E. Gide, “Comparative performance analysis of K-nearest neighbor (KNN) algorithm and its different variants for disease prediction,” Sci. Rep., vol. 12, pp. 6256, Apr. 2022, doi: https://doi.org/10.1038/s41598-022-10358-x.
[20] S. Priyadarshinee and M. Panda, “Cardiac disease prediction using smote and machine learning classifiers,” Jour. of Pharma. Neg. Res., vol. 13, no.8, pp. 856-862, Nov. 2022, doi: https://doi.org/10.47750/pnr.2022.13.S08.108.
[21] L. Breiman, “Random forests,” Mac. Learn., vol. 45, pp. 5-32, Oct. 2001, doi: https://doi.org/10.1023/A:1010933404324.
[22] A. Natekin and A. Knoll, “Gradient boosting machines, a tutorial,” Front. in Neurorob., vol. 7, pp. 21, Dec. 2013, doi: https://doi.org/10.3389/fnbot.2013.00021.
[23] T. Chen, T. He, M. Benesty, V. Khotilovich, Y. Tang, H. Cho, and T. Zhou, “Xgboost: extreme gradient boosting,” R package version 0.4-2, vol. 1, no. 4, pp. 1-4, 2015.
[24] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, and T. Y. Liu, “Lightgbm: A highly efficient gradient boosting decision tree,” Adv. in Neur. Info. Process. Sys., 30, 2017.
[25] L. Prokhorenkova, G. Gusev, A. Vorobev, A. V. Dorogush, and A. Gulin, “CatBoost: unbiased boosting with categorical features,” Adv. in Neur. Info. Process. Sys., 31, 2018.

There are 25 citations in total.

Details

Primary Language	English
Subjects	Computer Software, Software Engineering (Other)
Journal Section	Research Articles
Authors	Naciye Nur Arslan 0000-0002-3208-7986 Durmuş Özdemir 0000-0002-9543-4076
Publication Date	April 30, 2024
Submission Date	September 25, 2023
Published in Issue	Year 2024 Issue: 006

Cite

APA	Arslan, N. N., & Özdemir, D. (2024). A comparison of traditional and state-of-the-art machine learning algorithms for type 2 diabetes prediction. Journal of Scientific Reports-C(006), 1-11.
AMA	Arslan NN, Özdemir D. A comparison of traditional and state-of-the-art machine learning algorithms for type 2 diabetes prediction. Journal of Scientific Reports-C. April 2024;(006):1-11.
Chicago	Arslan, Naciye Nur, and Durmuş Özdemir. “A Comparison of Traditional and State-of-the-Art Machine Learning Algorithms for Type 2 Diabetes Prediction”. Journal of Scientific Reports-C, no. 006 (April 2024): 1-11.
EndNote	Arslan NN, Özdemir D (April 1, 2024) A comparison of traditional and state-of-the-art machine learning algorithms for type 2 diabetes prediction. Journal of Scientific Reports-C 006 1–11.
IEEE	N. N. Arslan and D. Özdemir, “A comparison of traditional and state-of-the-art machine learning algorithms for type 2 diabetes prediction”, Journal of Scientific Reports-C, no. 006, pp. 1–11, April2024.
ISNAD	Arslan, Naciye Nur - Özdemir, Durmuş. “A Comparison of Traditional and State-of-the-Art Machine Learning Algorithms for Type 2 Diabetes Prediction”. Journal of Scientific Reports-C 006 (April2024), 1-11.
JAMA	Arslan NN, Özdemir D. A comparison of traditional and state-of-the-art machine learning algorithms for type 2 diabetes prediction. Journal of Scientific Reports-C. 2024;:1–11.
MLA	Arslan, Naciye Nur and Durmuş Özdemir. “A Comparison of Traditional and State-of-the-Art Machine Learning Algorithms for Type 2 Diabetes Prediction”. Journal of Scientific Reports-C, no. 006, 2024, pp. 1-11.
Vancouver	Arslan NN, Özdemir D. A comparison of traditional and state-of-the-art machine learning algorithms for type 2 diabetes prediction. Journal of Scientific Reports-C. 2024(006):1-11.

Download Cover Image

Article Files

Full Text