Research Article
BibTex RIS Cite

Comparative Performance Analysis of Selected Machine Learning Algorithms and the Stacking Ensemble Method for Prediction of the Type II Diabetes Disease

Year 2024, Volume: 11 Issue: 3, 622 - 646, 30.09.2024
https://doi.org/10.54287/gujsa.1531997

Abstract

Diabetes is a prevalent non-communicable disease affecting many people globally. The common risk factors are obesity, age, lack of exercise, lifestyle, genetic factors, high blood pressure, and poor diet. Early identification of this condition can help prevent subsequent complications, including heart attacks, lower limb amputations, nerve damage, and blindness. Data mining and machine learning have become popular and successful methods of identifying numerous diseases, including Diabetes, using clinical data over the years. This study focuses on the principles and processes of Naïve Bayes, Support Vector Machines, Logistic Regression, Decision Tree, and Random Forest algorithms for diabetes prediction, using the Scikit-learn inbuilt libraries for the experiments. Furthermore, we ensemble all five machine learning models to produce a single stacked ensemble model. Data preprocessing techniques such as scaling, missing data removal, dimensionality reduction, and balancing of target class were performed on the Jos Urban Diabetes dataset used for this study. The comparison of the algorithms' performances across various evaluation metrics, demonstrates that the Support Vector Machines algorithm outperform all others in terms of Accuracy, Precision, Sensitivity, and Matthew’s Correlation Coefficient with scores of 96.11%, 91.61%, 85.67%, and 82.59% respectively with 10-fold cross-validation. Furthermore, the Stacked Ensemble Method model had the best Area Under the Receiver Operating Characteristic Curve scores of 98.47% with 10-fold cross-validation.

References

  • Armstrong, A. (2022, March 1). Python in Healthcare: AI Applications in Hospitals. https://www.datacamp.com/blog/python-in-healthcare-ai-applications-in-hospitals?utm_medium=email&utm_source=customerio&utm_id=7430059&utm_campaign=dc_insights&utm_term=v2blog
  • Bhatia, P. (2019). Data mining and data warehousing: Principles and practical techniques. Cambridge University Press.
  • Birjais, R., Mourya, A. K., Chauhan, R., & Kaur, H. (2019). Prediction and diagnosis of future diabetes risk: A machine learning approach. SN Applied Sciences, 1(9), 1112. https://doi.org/10.1007/s42452-019-1117-9
  • Choudhary, D. (2021, April 18). Bootstrapping and OOB samples in Random Forests. Analytics Vidhya. https://medium.com/analytics-vidhya/bootstrapping-and-oob-samples-in-random-forests-6e083b6bc341
  • Choudhury, A., & Gupta, D. (2019). A Survey on Medical Diagnosis of Diabetes Using Machine Learning Techniques. In: J. Kalita, V. E. Balas, S. Borah, & R. Pradhan (Eds.), Recent Developments in Machine Learning and Data Analytics (Vol. 740, pp. 67-78). Springer Singapore. https://doi.org/10.1007/978-981-13-1280-9_6
  • Gandhi, R. (2018, May 17). Naive Bayes Classifier. Medium. https://towardsdatascience.com/naive-bayes-classifier-81d512f50a7c
  • Harrison, G. (2022, February 28). A Deep Dive into Stacking Ensemble Machine Learning—Part I. Medium. https://towardsdatascience.com/a-deep-dive-into-stacking-ensemble-machine-learning-part-i-10476b2ade3
  • Ibrahim, I., & Abdulazeez, A. (2021). The Role of Machine Learning Algorithms for Diagnosing Diseases. Journal of Applied Science and Technology Trends, 2(01), 10-19. https://doi.org/10.38094/jastt20179
  • IDF (International Diabetes Federation) (2021). IDF Diabetes Atlas 10th ed.
  • Jakkula, V. (2010) Tutorial on Support Vector Machine (SVM).
  • Khanam, J. J., & Foo, S. Y. (2021). A comparison of machine learning algorithms for diabetes prediction. ICT Express, 7(4), 432-439. https://doi.org/10.1016/j.icte.2021.02.004
  • Lanhenke, M. (2022, May 1). Implementing Support Vector Machine From Scratch. Medium. https://towardsdatascience.com/implementing-svm-from-scratch-784e4ad0bc6a
  • Loeber, P. (2019a, September 29). Naive Bayes in Python—Machine Learning From Scratch 05—Python Tutorial—YouTube. https://www.youtube.com/watch?v=BqUmKsfSWho&list=PLqnslRFeH2Upcrywf-u2etjdxxkL8nl7E&index=5
  • Loeber, P. (2019b, November 22). Decision Tree in Python Part 2/2—Machine Learning From Scratch 09—Python Tutorial. https://www.youtube.com/watch?v=Bqi7EFFvNOg
  • Loeber, P. (2019c, November 27). Random Forest in Python—Machine Learning From Scratch 10—Python Tutorial. https://www.youtube.com/watch?v=Oq1cKjR8hNo
  • Maniruzzaman, Md., Rahman, Md. J., Ahammed, B., & Abedin, Md. M. (2020). Classification and prediction of diabetes disease using machine learning paradigm. Health Information Science and Systems, 8(1), 7. https://doi.org/10.1007/s13755-019-0095-z
  • Normalized Nerd (Director). (2021, January 13). Decision Tree Classification Clearly Explained! https://www.youtube.com/watch?v=ZVR2Way4nwQ
  • Pranto, B., Mehnaz, S., Mahid, E. B., Sadman, I. M., Rahman, A., & Momen, S. (2020). Evaluating machine learning methods for predicting diabetes among female patients in Bangladesh. Information, 11(8), 374. https://doi.org/10.3390/info11080374
  • Prasanna, S. (2019). Machine Learning with Python. 1, 167.
  • Punthakee, Z., Goldenberg, R., & Katz, P. (2018). Definition, Classification and Diagnosis of Diabetes, Prediabetes and Metabolic Syndrome. Canadian Journal of Diabetes, 42, S10-S15. https://doi.org/10.1016/j.jcjd.2017.10.003
  • Rokach, L. (2009). Pattern Classification Using Ensemble Methods (Illustrated edition, Vol. 75). World Scientific Publishing Company.
  • Shmueli, G., Bruce, P. C., Gedeck, P., & Patel, N. R. (2019). Data mining for business analytics: Concepts, techniques and applications in Python (1st ed.). John Wiley & Sons.
  • Singh, H. (2021, March 30). Variants of Stacking | Types of Stacking—Advanced Ensemble Learning. Analytics Vidhya. https://www.analyticsvidhya.com/blog/2021/03/advanced-ensemble-learning-technique-stacking-and-its-variants/
  • Sisodia, D., & Sisodia, D. S. (2018). Prediction of diabetes using classification algorithms. Procedia Computer Science, 132, 1578-1585.
  • Sruthi, E. R. (2021, June 17). Random Forest | Introduction to Random Forest Algorithm. Analytics Vidhya. https://www.analyticsvidhya.com/blog/2021/06/understanding-random-forest/
  • Swaminathan, S. (2019, January 18). Logistic Regression—Detailed Overview. Medium. https://towardsdatascience.com/logistic-regression-detailed-overview-46c4da4303bc
  • Tigga, N. P., & Garg, S. (2020). Prediction of type 2 diabetes using machine learning classification methods. Procedia Computer Science, 167, 706-716.
  • WHO (World Health Organization) (2021, November 10). Diabetes. https://www.who.int/news-room/fact-sheets/detail/diabetes
  • Zhang, C., & Ma, Y. (Eds.). (2012). Ensemble Machine Learning: Methods and Applications (2012th edition). Springer.
  • Zhu, C., Idemudia, C. U., & Feng, W. (2019). Improved logistic regression model for diabetes prediction by integrating PCA and K-means techniques. Informatics in Medicine Unlocked, 17, 100179.
  • Zimmet, P., Alberti, K. G., Magliano, D. J., & Bennett, P. H. (2016). Diabetes mellitus statistics on prevalence and mortality: Facts and fallacies. Nature Reviews Endocrinology, 12(10), 616-622.
Year 2024, Volume: 11 Issue: 3, 622 - 646, 30.09.2024
https://doi.org/10.54287/gujsa.1531997

Abstract

References

  • Armstrong, A. (2022, March 1). Python in Healthcare: AI Applications in Hospitals. https://www.datacamp.com/blog/python-in-healthcare-ai-applications-in-hospitals?utm_medium=email&utm_source=customerio&utm_id=7430059&utm_campaign=dc_insights&utm_term=v2blog
  • Bhatia, P. (2019). Data mining and data warehousing: Principles and practical techniques. Cambridge University Press.
  • Birjais, R., Mourya, A. K., Chauhan, R., & Kaur, H. (2019). Prediction and diagnosis of future diabetes risk: A machine learning approach. SN Applied Sciences, 1(9), 1112. https://doi.org/10.1007/s42452-019-1117-9
  • Choudhary, D. (2021, April 18). Bootstrapping and OOB samples in Random Forests. Analytics Vidhya. https://medium.com/analytics-vidhya/bootstrapping-and-oob-samples-in-random-forests-6e083b6bc341
  • Choudhury, A., & Gupta, D. (2019). A Survey on Medical Diagnosis of Diabetes Using Machine Learning Techniques. In: J. Kalita, V. E. Balas, S. Borah, & R. Pradhan (Eds.), Recent Developments in Machine Learning and Data Analytics (Vol. 740, pp. 67-78). Springer Singapore. https://doi.org/10.1007/978-981-13-1280-9_6
  • Gandhi, R. (2018, May 17). Naive Bayes Classifier. Medium. https://towardsdatascience.com/naive-bayes-classifier-81d512f50a7c
  • Harrison, G. (2022, February 28). A Deep Dive into Stacking Ensemble Machine Learning—Part I. Medium. https://towardsdatascience.com/a-deep-dive-into-stacking-ensemble-machine-learning-part-i-10476b2ade3
  • Ibrahim, I., & Abdulazeez, A. (2021). The Role of Machine Learning Algorithms for Diagnosing Diseases. Journal of Applied Science and Technology Trends, 2(01), 10-19. https://doi.org/10.38094/jastt20179
  • IDF (International Diabetes Federation) (2021). IDF Diabetes Atlas 10th ed.
  • Jakkula, V. (2010) Tutorial on Support Vector Machine (SVM).
  • Khanam, J. J., & Foo, S. Y. (2021). A comparison of machine learning algorithms for diabetes prediction. ICT Express, 7(4), 432-439. https://doi.org/10.1016/j.icte.2021.02.004
  • Lanhenke, M. (2022, May 1). Implementing Support Vector Machine From Scratch. Medium. https://towardsdatascience.com/implementing-svm-from-scratch-784e4ad0bc6a
  • Loeber, P. (2019a, September 29). Naive Bayes in Python—Machine Learning From Scratch 05—Python Tutorial—YouTube. https://www.youtube.com/watch?v=BqUmKsfSWho&list=PLqnslRFeH2Upcrywf-u2etjdxxkL8nl7E&index=5
  • Loeber, P. (2019b, November 22). Decision Tree in Python Part 2/2—Machine Learning From Scratch 09—Python Tutorial. https://www.youtube.com/watch?v=Bqi7EFFvNOg
  • Loeber, P. (2019c, November 27). Random Forest in Python—Machine Learning From Scratch 10—Python Tutorial. https://www.youtube.com/watch?v=Oq1cKjR8hNo
  • Maniruzzaman, Md., Rahman, Md. J., Ahammed, B., & Abedin, Md. M. (2020). Classification and prediction of diabetes disease using machine learning paradigm. Health Information Science and Systems, 8(1), 7. https://doi.org/10.1007/s13755-019-0095-z
  • Normalized Nerd (Director). (2021, January 13). Decision Tree Classification Clearly Explained! https://www.youtube.com/watch?v=ZVR2Way4nwQ
  • Pranto, B., Mehnaz, S., Mahid, E. B., Sadman, I. M., Rahman, A., & Momen, S. (2020). Evaluating machine learning methods for predicting diabetes among female patients in Bangladesh. Information, 11(8), 374. https://doi.org/10.3390/info11080374
  • Prasanna, S. (2019). Machine Learning with Python. 1, 167.
  • Punthakee, Z., Goldenberg, R., & Katz, P. (2018). Definition, Classification and Diagnosis of Diabetes, Prediabetes and Metabolic Syndrome. Canadian Journal of Diabetes, 42, S10-S15. https://doi.org/10.1016/j.jcjd.2017.10.003
  • Rokach, L. (2009). Pattern Classification Using Ensemble Methods (Illustrated edition, Vol. 75). World Scientific Publishing Company.
  • Shmueli, G., Bruce, P. C., Gedeck, P., & Patel, N. R. (2019). Data mining for business analytics: Concepts, techniques and applications in Python (1st ed.). John Wiley & Sons.
  • Singh, H. (2021, March 30). Variants of Stacking | Types of Stacking—Advanced Ensemble Learning. Analytics Vidhya. https://www.analyticsvidhya.com/blog/2021/03/advanced-ensemble-learning-technique-stacking-and-its-variants/
  • Sisodia, D., & Sisodia, D. S. (2018). Prediction of diabetes using classification algorithms. Procedia Computer Science, 132, 1578-1585.
  • Sruthi, E. R. (2021, June 17). Random Forest | Introduction to Random Forest Algorithm. Analytics Vidhya. https://www.analyticsvidhya.com/blog/2021/06/understanding-random-forest/
  • Swaminathan, S. (2019, January 18). Logistic Regression—Detailed Overview. Medium. https://towardsdatascience.com/logistic-regression-detailed-overview-46c4da4303bc
  • Tigga, N. P., & Garg, S. (2020). Prediction of type 2 diabetes using machine learning classification methods. Procedia Computer Science, 167, 706-716.
  • WHO (World Health Organization) (2021, November 10). Diabetes. https://www.who.int/news-room/fact-sheets/detail/diabetes
  • Zhang, C., & Ma, Y. (Eds.). (2012). Ensemble Machine Learning: Methods and Applications (2012th edition). Springer.
  • Zhu, C., Idemudia, C. U., & Feng, W. (2019). Improved logistic regression model for diabetes prediction by integrating PCA and K-means techniques. Informatics in Medicine Unlocked, 17, 100179.
  • Zimmet, P., Alberti, K. G., Magliano, D. J., & Bennett, P. H. (2016). Diabetes mellitus statistics on prevalence and mortality: Facts and fallacies. Nature Reviews Endocrinology, 12(10), 616-622.
There are 31 citations in total.

Details

Primary Language English
Subjects Machine Learning (Other)
Journal Section Information and Computing Sciences
Authors

Nathan Zoakah 0009-0000-3873-8471

Augustine Shey Nsang 0000-0002-6466-9032

Abel Ajibesin 0000-0001-6518-0231

Ayuba Zoakah 0000-0003-1856-7753

Publication Date September 30, 2024
Submission Date August 12, 2024
Acceptance Date September 16, 2024
Published in Issue Year 2024 Volume: 11 Issue: 3

Cite

APA Zoakah, N., Shey Nsang, A., Ajibesin, A., Zoakah, A. (2024). Comparative Performance Analysis of Selected Machine Learning Algorithms and the Stacking Ensemble Method for Prediction of the Type II Diabetes Disease. Gazi University Journal of Science Part A: Engineering and Innovation, 11(3), 622-646. https://doi.org/10.54287/gujsa.1531997