Research Article
BibTex RIS Cite

Leveraging SHAP for Interpretable Diabetes Prediction: A Study of Machine Learning Models on the Pima Indians Diabetes Dataset

Year 2025, Volume: 13 Issue: 2, 128 - 139
https://doi.org/10.17694/bajece.1577929

Abstract

This paper investigates the application of machine learning (ML) models for predicting diabetes using the Pima Indians Diabetes Database, with a focus on enhancing model interpretability through the use of SHapley Additive exPlanations (SHAP). The study evaluates eight ML models, including Adaptive Boosting (AdaBoost), k-Nearest Neighbors (k-NN), Logistic Regression (LR), Multi-layer Perceptron (MLP), Naive Bayes (NB), Random Forest (RF), Support Vector Machine (SVM), and eXtreme Gradient Boosting (XGBoost), utilizing both test/train split and 10-fold cross-validation methods. The RF model demonstrated superior performance, achieving an accuracy of 82% and an F1-score of 0.83 in the test/train split, and an accuracy of 83% and an F1-score of 0.84 in the 10-fold cross-validation. SHAP analysis was employed to identify the most influential predictors, revealing that glucose, BMI, pregnancies, and insulin levels are the key factors in diabetes prediction, aligning with established clinical markers. Additionally, the use of the Synthetic Minority Over-sampling TEchnique (SMOTE) for class balancing and data scaling contributes to robust model performance. The study emphasizes the necessity for interpretable ML in healthcare, proposing SHAP as a valuable tool for bridging predictive accuracy and clinical transparency in diabetes diagnostics.

References

  • [1] P. David, S. Singh, R. Ankar, “A comprehensive overview of skin complications in diabetes and their prevention,” Cureus, vol. 15, no. 5, p. e38961, 2023.
  • [2] A. F. Walker et al., “Interventions to address global inequity in diabetes: international progress,” Lancet, vol. 402, no. 10397, 2023, pp. 250-264.
  • [3] M. Zakir et al., “Cardiovascular complications of diabetes: From microvascular to macrovascular pathways,” Cureus, vol. 15, no. 9, p. e45835, 2023.
  • [4] A. Avogaro, M. Rigato, E. di Brino, D. Bianco, I. Gianotto, G. Brusaporco, “The socio-environmental determinants of diabetes and their consequences,” Acta Diabetol., vol. 61, no. 10, 2024, pp. 1205-1210.
  • [5] S. Gowthami, R. Venkata Siva Reddy, M. R. Ahmed, “Exploring the effectiveness of machine learning algorithms for early detection of Type-2 Diabetes Mellitus,” Measur. Sens., vol. 31, no. 100983, p. 100983, 2024.
  • [6] A. A. L. Ahmad, A. A. Mohamed, “Artificial intelligence and machine learning techniques in the diagnosis of type I diabetes: Case studies,” in Studies in Computational Intelligence, Singapore: Springer Nature Singapore, 2024, pp. 289-302.
  • [7] T. Althobaiti, S. Althobaiti, M. M. Selim, “An optimized diabetes mellitus detection model for improved prediction of accuracy and clinical decision-making,” Alex. Eng. J., vol. 94, 2024, pp. 311-324.
  • [8] R. F. Albadri, S. M. Awad, A. S. Hameed, T. H. Mandeel, R. A. Jabbar, “A diabetes prediction model using hybrid machine learning algorithm,” Math. Model. Eng. Probl., vol. 11, no. 8, 2024, pp. 2119-2126.
  • [9] S. Buyrukoğlu, A. Akbaş, “Machine learning based early prediction of type 2 diabetes: A new hybrid feature selection approach using Correlation Matrix with Heatmap and SFS,” Balkan Journal of Electrical and Computer Engineering, vol. 10, no. 2, 2022, pp. 110-117.
  • [10] A. Adadi, M. Berrada, “Peeking inside the black-box: A survey on explainable artificial intelligence (XAI),” IEEE Access, vol. 6, 2018, pp. 52138-52160.
  • [11] F. Bodria, F. Giannotti, R. Guidotti, F. Naretto, D. Pedreschi, S. Rinzivillo, “Benchmarking and survey of explanation methods for black box models,” Data Min. Knowl. Discov., vol. 37, no. 5, 2023, pp. 1719-1778.
  • [12] A. Barredo Arrieta et al., “Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI,” Inf. Fusion, vol. 58, 2020, pp. 82-115.
  • [13] W. Ding, M. Abdel-Basset, H. Hawash, A. M. Ali, “Explainability of artificial intelligence methods, applications and challenges: A comprehensive survey,” Inf. Sci. (Ny), vol. 615, 2022, pp. 238-292.
  • [14] V. Hassija et al., “Interpreting black-box models: A review on explainable Artificial Intelligence,” Cognit. Comput., vol. 16, no. 1, 2024, pp. 45-74.
  • [15] S. Lundberg, S.-I. Lee, “A unified approach to interpreting model predictions,” in Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan et al., Eds. Curran Associates, Inc., 2017.
  • [16] L. S. Shapley, “Stochastic games,” Proc. Natl. Acad. Sci. U.S.A., vol. 39, no. 10, 1953, pp. 1095-1100.
  • [17] K. Aliyeva, N. Mehdiyev, “Uncertainty-aware multi-criteria decision analysis for evaluation of explainable artificial intelligence methods: A use case from the healthcare domain,” Information Sciences, vol. 657, no. 119987, p. 119987, 2024.
  • [18] Kaggle Dataset, “Pima Indian Diabetes Database,” 2017. [Online]. Available: https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database
  • [19] P. Verma, A. Khatoon, “Data mining applications in healthcare: A comparative analysis of classification techniques for diabetes diagnosis using the PIMA Indian diabetes dataset,” in 2024 4th International Conference on Innovative Practices in Technology and Management (ICIPTM), 2024.
  • [20] L. Xie, “Pima Indian diabetes database and machine learning models for diabetes prediction,” Highlights in Science, Engineering and Technology, vol. 88, 2024, pp. 97-103.
  • [21] V. Chang, J. Bailey, Q. A. Xu, Z. Sun, “Pima Indians diabetes mellitus classification based on machine learning (ML) algorithms,” Neural Comput. Appl., vol. 35, no. 22, 2022, pp. 1-17.
  • [22] S. Sahoo, T. Mitra, A. K. Mohanty, B. J. R. Sahoo, and S. Rath, “Diabetes prediction: A study of various classification based data mining techniques,” International Journal of Computer Science and Informatics, vol. 4, no. 3, 2022, pp. 1-13.
  • [23] S. You, M. Kang, “A Study on Methods to Prevent Pima Indians Diabetes using SVM,” Korean Journal of Artificial Intelligence, vol. 8, no. 2, 2020, pp. 7-10.
  • [24] A. F. Ashour, M. M. Fouda, Z. M. Fadlullah, M. I. Ibrahem, “Optimized neural networks for diabetes classification using Pima Indians diabetes database,” in 2024 IEEE 3rd International Conference on Computing and Machine Intelligence (ICMI), 2024.
  • [25] K. Akyol, B. Şen, “Diabetes mellitus data classification by cascading of feature selection methods and ensemble learning algorithms,” Int. J. Mod. Educ. Comput. Sci., vol. 6, 2018, pp. 10-16.
  • [26] M. S. Reza, R. Amin, R. Yasmin, W. Kulsum, S. Ruhi, “Improving diabetes disease patients classification using stacking ensemble method with PIMA and local healthcare data,” Heliyon, vol. 10, p. e24536, 2024. [27] A. Pyne, B. Chakraborty, “Artificial Neural Network based approach to Diabetes Prediction using Pima Indians Diabetes Dataset,” in 2023 International Conference on Control, Automation and Diagnosis (ICCAD), Rome, Italy, 2023.
  • [28] A V. Jain, S. Shukla, N. Khare, “Analysis of various data imputation techniques for diabetes classification on PIMA dataset,” in 2024 IEEE International Students’ Conference on Electrical, Electronics and Computer Science (SCEECS), 2024, pp. 1–6.
  • [29] S. Karatsiolis, C. N. Schizas, “Region based Support Vector Machine algorithm for medical diagnosis on Pima Indian Diabetes dataset,” in 2012 IEEE 12th International Conference on Bioinformatics & Bioengineering (BIBE), 2012.
  • [30] M. Bilal, G. Ali, M. W. Iqbal, M. Anwar, M. S. A. Malik, R. A. Kadir, “Auto-Prep: Efficient and Automated Data Preprocessing Pipeline,” IEEE Access, vol. 10, 2022, pp. 107764-107784.
  • [31] L. B. V. de Amorim, G. D. C. Cavalcanti, R. M. O. Cruz, “The choice of scaling technique matters for classification performance,” Appl. Soft Comput., vol. 133, no. 109924, p. 109924, 2023.
  • [32] A. D. Amirruddin, F. M. Muharam, M. H. Ismail, N. P. Tan, M. F. Ismail, “Synthetic Minority Over-sampling TEchnique (SMOTE) and Logistic Model Tree (LMT)-Adaptive Boosting algorithms for classifying imbalanced datasets of nutrient and chlorophyll sufficiency levels of oil palm (Elaeis guineensis) using spectroradiometers and unmanned aerial vehicles,” Comput. Electron. Agric., vol. 193, no. 106646, p. 106646, 2022.
  • [33] N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer, “SMOTE: Synthetic minority over-sampling technique,” J. Artif. Intell. Res., vol. 16, 2002, pp. 321-357.
  • [34] T. Yi̇lmaz, “Microwave spectroscopy based classification of rat hepatic tissues: On the significance of dataset,” Balkan Journal of Electrical and Computer Engineering, vol. 8, no. 4, 2020, pp. 307-313.
  • [35] T. Tulgar, A. Haydar, İ. Erşan, “A distributed K Nearest Neighbor classifier for Big Data,” Balkan Journal of Electrical and Computer Engineering, vol. 6, no. 2, 2018, pp. 105-111.
  • [36] T. Pala, A. Y. Camurcu, “Design of decision support system in the metastatic colorectal cancer data set and its application,” Balkan Journal of Electrical and Computer Engineering, vol. 4, no. 1, 2016, pp. 12-16.
  • [37] C. Greco, P. Pace, S. Basagni, G. Fortino, “Jamming detection at the edge of drone networks using Multi-layer Perceptrons and Decision Trees,” Appl. Soft Comput., vol. 111, no. 107806, p. 107806, 2021.
  • [38] İ. Kırbaş, A. Çifci, “Machine Learning-Based Rice Grain Classification Through Numerical Feature Extraction from Rice Image Data.” in 9th International Zeugma Conference on Scientific Research. Gaziantep, Türkiye, 2023. [39] A. Çifci, M. İlkuçar, “Analysis of window sizes in prediction of daily COVID-19 cases using machine learning models,” International Journal of Mechatronics, Electrical and Computer Technology (IJMEC), vol. 12, no. 45, 2022, pp. 5208-5217.
  • [40] G. Bilgin, A. Çifci, “Eritematöz skuamöz hastalıkların teşhisinde makine öğrenme algoritmaları performanslarının değerlendirilmesi,” Journal of Intelligent Systems: Theory and Applications, vol. 4, no. 2, 2021, pp. 195-202.
  • [41] C. Bentéjac, A. Csörgő, G. Martínez-Muñoz, “A comparative analysis of gradient boosting algorithms,” Artif. Intell. Rev., vol. 54, no. 3, 2021, pp. 1937-1967.
  • [42] C. Molnar, Interpretable machine learning: a guide for making black box models interpretable. Morisville, North Carolina: Lulu, 2019.
  • [43] S. M. Lundberg et al., “From local explanations to global understanding with explainable AI for trees,” Nat. Mach. Intell., vol. 2, no. 1, 2020, pp. 56-67.
  • [44] S. M. Lundberg, G. G. Erion, S.-I. Lee, “Consistent individualized feature attribution for tree ensembles,” arXiv [cs.LG], 2018.
Year 2025, Volume: 13 Issue: 2, 128 - 139
https://doi.org/10.17694/bajece.1577929

Abstract

References

  • [1] P. David, S. Singh, R. Ankar, “A comprehensive overview of skin complications in diabetes and their prevention,” Cureus, vol. 15, no. 5, p. e38961, 2023.
  • [2] A. F. Walker et al., “Interventions to address global inequity in diabetes: international progress,” Lancet, vol. 402, no. 10397, 2023, pp. 250-264.
  • [3] M. Zakir et al., “Cardiovascular complications of diabetes: From microvascular to macrovascular pathways,” Cureus, vol. 15, no. 9, p. e45835, 2023.
  • [4] A. Avogaro, M. Rigato, E. di Brino, D. Bianco, I. Gianotto, G. Brusaporco, “The socio-environmental determinants of diabetes and their consequences,” Acta Diabetol., vol. 61, no. 10, 2024, pp. 1205-1210.
  • [5] S. Gowthami, R. Venkata Siva Reddy, M. R. Ahmed, “Exploring the effectiveness of machine learning algorithms for early detection of Type-2 Diabetes Mellitus,” Measur. Sens., vol. 31, no. 100983, p. 100983, 2024.
  • [6] A. A. L. Ahmad, A. A. Mohamed, “Artificial intelligence and machine learning techniques in the diagnosis of type I diabetes: Case studies,” in Studies in Computational Intelligence, Singapore: Springer Nature Singapore, 2024, pp. 289-302.
  • [7] T. Althobaiti, S. Althobaiti, M. M. Selim, “An optimized diabetes mellitus detection model for improved prediction of accuracy and clinical decision-making,” Alex. Eng. J., vol. 94, 2024, pp. 311-324.
  • [8] R. F. Albadri, S. M. Awad, A. S. Hameed, T. H. Mandeel, R. A. Jabbar, “A diabetes prediction model using hybrid machine learning algorithm,” Math. Model. Eng. Probl., vol. 11, no. 8, 2024, pp. 2119-2126.
  • [9] S. Buyrukoğlu, A. Akbaş, “Machine learning based early prediction of type 2 diabetes: A new hybrid feature selection approach using Correlation Matrix with Heatmap and SFS,” Balkan Journal of Electrical and Computer Engineering, vol. 10, no. 2, 2022, pp. 110-117.
  • [10] A. Adadi, M. Berrada, “Peeking inside the black-box: A survey on explainable artificial intelligence (XAI),” IEEE Access, vol. 6, 2018, pp. 52138-52160.
  • [11] F. Bodria, F. Giannotti, R. Guidotti, F. Naretto, D. Pedreschi, S. Rinzivillo, “Benchmarking and survey of explanation methods for black box models,” Data Min. Knowl. Discov., vol. 37, no. 5, 2023, pp. 1719-1778.
  • [12] A. Barredo Arrieta et al., “Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI,” Inf. Fusion, vol. 58, 2020, pp. 82-115.
  • [13] W. Ding, M. Abdel-Basset, H. Hawash, A. M. Ali, “Explainability of artificial intelligence methods, applications and challenges: A comprehensive survey,” Inf. Sci. (Ny), vol. 615, 2022, pp. 238-292.
  • [14] V. Hassija et al., “Interpreting black-box models: A review on explainable Artificial Intelligence,” Cognit. Comput., vol. 16, no. 1, 2024, pp. 45-74.
  • [15] S. Lundberg, S.-I. Lee, “A unified approach to interpreting model predictions,” in Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan et al., Eds. Curran Associates, Inc., 2017.
  • [16] L. S. Shapley, “Stochastic games,” Proc. Natl. Acad. Sci. U.S.A., vol. 39, no. 10, 1953, pp. 1095-1100.
  • [17] K. Aliyeva, N. Mehdiyev, “Uncertainty-aware multi-criteria decision analysis for evaluation of explainable artificial intelligence methods: A use case from the healthcare domain,” Information Sciences, vol. 657, no. 119987, p. 119987, 2024.
  • [18] Kaggle Dataset, “Pima Indian Diabetes Database,” 2017. [Online]. Available: https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database
  • [19] P. Verma, A. Khatoon, “Data mining applications in healthcare: A comparative analysis of classification techniques for diabetes diagnosis using the PIMA Indian diabetes dataset,” in 2024 4th International Conference on Innovative Practices in Technology and Management (ICIPTM), 2024.
  • [20] L. Xie, “Pima Indian diabetes database and machine learning models for diabetes prediction,” Highlights in Science, Engineering and Technology, vol. 88, 2024, pp. 97-103.
  • [21] V. Chang, J. Bailey, Q. A. Xu, Z. Sun, “Pima Indians diabetes mellitus classification based on machine learning (ML) algorithms,” Neural Comput. Appl., vol. 35, no. 22, 2022, pp. 1-17.
  • [22] S. Sahoo, T. Mitra, A. K. Mohanty, B. J. R. Sahoo, and S. Rath, “Diabetes prediction: A study of various classification based data mining techniques,” International Journal of Computer Science and Informatics, vol. 4, no. 3, 2022, pp. 1-13.
  • [23] S. You, M. Kang, “A Study on Methods to Prevent Pima Indians Diabetes using SVM,” Korean Journal of Artificial Intelligence, vol. 8, no. 2, 2020, pp. 7-10.
  • [24] A. F. Ashour, M. M. Fouda, Z. M. Fadlullah, M. I. Ibrahem, “Optimized neural networks for diabetes classification using Pima Indians diabetes database,” in 2024 IEEE 3rd International Conference on Computing and Machine Intelligence (ICMI), 2024.
  • [25] K. Akyol, B. Şen, “Diabetes mellitus data classification by cascading of feature selection methods and ensemble learning algorithms,” Int. J. Mod. Educ. Comput. Sci., vol. 6, 2018, pp. 10-16.
  • [26] M. S. Reza, R. Amin, R. Yasmin, W. Kulsum, S. Ruhi, “Improving diabetes disease patients classification using stacking ensemble method with PIMA and local healthcare data,” Heliyon, vol. 10, p. e24536, 2024. [27] A. Pyne, B. Chakraborty, “Artificial Neural Network based approach to Diabetes Prediction using Pima Indians Diabetes Dataset,” in 2023 International Conference on Control, Automation and Diagnosis (ICCAD), Rome, Italy, 2023.
  • [28] A V. Jain, S. Shukla, N. Khare, “Analysis of various data imputation techniques for diabetes classification on PIMA dataset,” in 2024 IEEE International Students’ Conference on Electrical, Electronics and Computer Science (SCEECS), 2024, pp. 1–6.
  • [29] S. Karatsiolis, C. N. Schizas, “Region based Support Vector Machine algorithm for medical diagnosis on Pima Indian Diabetes dataset,” in 2012 IEEE 12th International Conference on Bioinformatics & Bioengineering (BIBE), 2012.
  • [30] M. Bilal, G. Ali, M. W. Iqbal, M. Anwar, M. S. A. Malik, R. A. Kadir, “Auto-Prep: Efficient and Automated Data Preprocessing Pipeline,” IEEE Access, vol. 10, 2022, pp. 107764-107784.
  • [31] L. B. V. de Amorim, G. D. C. Cavalcanti, R. M. O. Cruz, “The choice of scaling technique matters for classification performance,” Appl. Soft Comput., vol. 133, no. 109924, p. 109924, 2023.
  • [32] A. D. Amirruddin, F. M. Muharam, M. H. Ismail, N. P. Tan, M. F. Ismail, “Synthetic Minority Over-sampling TEchnique (SMOTE) and Logistic Model Tree (LMT)-Adaptive Boosting algorithms for classifying imbalanced datasets of nutrient and chlorophyll sufficiency levels of oil palm (Elaeis guineensis) using spectroradiometers and unmanned aerial vehicles,” Comput. Electron. Agric., vol. 193, no. 106646, p. 106646, 2022.
  • [33] N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer, “SMOTE: Synthetic minority over-sampling technique,” J. Artif. Intell. Res., vol. 16, 2002, pp. 321-357.
  • [34] T. Yi̇lmaz, “Microwave spectroscopy based classification of rat hepatic tissues: On the significance of dataset,” Balkan Journal of Electrical and Computer Engineering, vol. 8, no. 4, 2020, pp. 307-313.
  • [35] T. Tulgar, A. Haydar, İ. Erşan, “A distributed K Nearest Neighbor classifier for Big Data,” Balkan Journal of Electrical and Computer Engineering, vol. 6, no. 2, 2018, pp. 105-111.
  • [36] T. Pala, A. Y. Camurcu, “Design of decision support system in the metastatic colorectal cancer data set and its application,” Balkan Journal of Electrical and Computer Engineering, vol. 4, no. 1, 2016, pp. 12-16.
  • [37] C. Greco, P. Pace, S. Basagni, G. Fortino, “Jamming detection at the edge of drone networks using Multi-layer Perceptrons and Decision Trees,” Appl. Soft Comput., vol. 111, no. 107806, p. 107806, 2021.
  • [38] İ. Kırbaş, A. Çifci, “Machine Learning-Based Rice Grain Classification Through Numerical Feature Extraction from Rice Image Data.” in 9th International Zeugma Conference on Scientific Research. Gaziantep, Türkiye, 2023. [39] A. Çifci, M. İlkuçar, “Analysis of window sizes in prediction of daily COVID-19 cases using machine learning models,” International Journal of Mechatronics, Electrical and Computer Technology (IJMEC), vol. 12, no. 45, 2022, pp. 5208-5217.
  • [40] G. Bilgin, A. Çifci, “Eritematöz skuamöz hastalıkların teşhisinde makine öğrenme algoritmaları performanslarının değerlendirilmesi,” Journal of Intelligent Systems: Theory and Applications, vol. 4, no. 2, 2021, pp. 195-202.
  • [41] C. Bentéjac, A. Csörgő, G. Martínez-Muñoz, “A comparative analysis of gradient boosting algorithms,” Artif. Intell. Rev., vol. 54, no. 3, 2021, pp. 1937-1967.
  • [42] C. Molnar, Interpretable machine learning: a guide for making black box models interpretable. Morisville, North Carolina: Lulu, 2019.
  • [43] S. M. Lundberg et al., “From local explanations to global understanding with explainable AI for trees,” Nat. Mach. Intell., vol. 2, no. 1, 2020, pp. 56-67.
  • [44] S. M. Lundberg, G. G. Erion, S.-I. Lee, “Consistent individualized feature attribution for tree ensembles,” arXiv [cs.LG], 2018.
There are 42 citations in total.

Details

Primary Language English
Subjects Computer Software
Journal Section Araştırma Articlessi
Authors

İsmail Kırbaş 0000-0002-1206-8294

Ahmet Çifci 0000-0001-7679-9945

Early Pub Date July 11, 2025
Publication Date
Submission Date November 1, 2024
Acceptance Date December 27, 2024
Published in Issue Year 2025 Volume: 13 Issue: 2

Cite

APA Kırbaş, İ., & Çifci, A. (2025). Leveraging SHAP for Interpretable Diabetes Prediction: A Study of Machine Learning Models on the Pima Indians Diabetes Dataset. Balkan Journal of Electrical and Computer Engineering, 13(2), 128-139. https://doi.org/10.17694/bajece.1577929

All articles published by BAJECE are licensed under the Creative Commons Attribution 4.0 International License. This permits anyone to copy, redistribute, remix, transmit and adapt the work provided the original work and source is appropriately cited.Creative Commons Lisansı