A Genetic Algorithm-Enhanced Method for Missing Value Imputation in Healthcare Datasets

Hatice Nizam Özoğur; Zeynep Orman

doi:10.18466/cbayarfbe.1766229

EN

A Genetic Algorithm-Enhanced Method for Missing Value Imputation in Healthcare Datasets

Abstract

In healthcare datasets, imbalanced class distributions and missing data pose significant challenges to the performance and stability of machine learning models, thereby hindering accurate analysis and disease diagnosis. Addressing these challenges is crucial for improving both the precision and reliability of healthcare data analysis. This paper proposes a novel preprocessing framework specifically designed for healthcare datasets to mitigate issues related to incomplete data and class imbalance. The framework introduces a new imputation method, GA-MICE, which enhances the Multiple Imputation by Chained Equations (MICE) technique using a Genetic Algorithm (GA) to improve the accuracy of handling missing data. Additionally, the framework incorporates the GASMOTEPSO_ENN method, which combines the Synthetic Minority Oversampling Technique (SMOTE) and Edited Nearest Neighbor (ENN) algorithms with GA and Particle Swarm Optimization (PSO) heuristics to effectively address class imbalance. After preprocessing, six machine learning classifiers are employed to categorize individuals as either patients or healthy subjects. The model's performance is evaluated using multiple metrics, including accuracy, precision, recall, F1-score, Matthews Correlation Coefficient (MCC), and Area Under the Curve (AUC). Experimental results demonstrate the effectiveness of the proposed approach in managing missing data and addressing class imbalance, achieving performance close to or exceeding existing methodologies reported in the literature.

Keywords

Ethical Statement

There are no ethical issues after the publication of this manuscript.

References

[1]. García-Laencina, PJ, Sancho-Gómez, JL, et al. 2010. Pattern classification with missing data: a review. Neural Computing and Applications; 19: 263-282. https://doi.org/10.1007/s00521-009-0295-6
[2]. Lin, W-C, Ke, S-W, et al. 2017. When should we ignore examples with missing values? International Journal of Data Warehousing and Mining (IJDWM); 13(4): 53-63. https://doi.org/10.4018/ijdwm.2017100104
[3]. Bertsimas, D, Pawlowski, C, et al. 2018. From predictive methods to missing data imputation: an optimization approach. Journal of Machine Learning Research; 18(196): 1-39. https://www.jmlr.org/papers/v18/17-073.html
[4]. Lin, W-C, Tsai, C-F, et al. 2022. Deep learning for missing value imputation of continuous data and the effect of data discretization. Knowledge-Based Systems; 239: 108079. https://doi.org/10.1016/j.knosys.2021.108079
[5]. Nizam-Ozogur, H, Orman, Z. 2024. A heuristic-based hybrid sampling method using a combination of smote and enn for imbalanced health data. Expert Systems; 41(8): e13596. https://doi.org/10.1111/exsy.13596
[6]. Parhi, SK, Patro, SK. 2023. Compressive strength prediction of pet fiber-reinforced concrete using dolphin echolocation optimized decision tree-based machine learning algorithms. Asian Journal of Civil Engineering; 25(1): 977-996. https://doi.org/10.1007/s42107-023-00826-8
[7]. Zhou, X-H, Eckert, GJ, et al. 2001. Multiple imputation in public health research. Statistics in Medicine; 20(9-10): 1541-1549. https://doi.org/10.1002/sim.689
[8]. Khan, SI, Hoque, ASML. 2020. Sice: an improved missing data imputation technique. Journal of Big Data; 7(1): 37. https://doi.org/10.1186/s40537-020-00313-w

[9]. Raja, PS, Thangavel, K. 2020. Missing value imputation using unsupervised machine learning techniques. Soft Computing; 24(6): 4361-4392. https://doi.org/10.1007/s00500-019-04199-6
[10]. Madhu, G, Bharadwaj, BL, et al. 2019. A novel algorithm for missing data imputation on machine learning. In: 2019 International Conference on Smart Systems and Inventive Technology (ICSSIT). IEEE, pp 173-177. https://doi.org/10.1109/icssit46314.2019.8987895
[11]. Rani, P, Kumar, R, et al. 2021. Hioc: a hybrid imputation method to predict missing values in medical datasets. International Journal of Intelligent Computing and Cybernetics; 14(4): 598-616. https://doi.org/10.1108/ijicc-03-2021-0042
[12]. Wang, Q, Cao, W, et al. 2019. Dmp mi: an effective diabetes mellitus classification algorithm on imbalanced data with missing values. IEEE Access; 7: 102232-102238. https://doi.org/10.1109/access.2019.2929866
[13]. Yadav, S, Maravi, YPS, et al. 2021. A neural network based diabetes prediction on imbalanced data. In: 2021 10th IEEE International Conference on Communication Systems and Network Technologies (CSNT). IEEE, pp 515-521. https://doi.org/10.1109/csnt51715.2021.9509732
[14]. Mienye, ID, Sun, Y. 2021. Performance analysis of cost-sensitive learning methods with application to imbalanced medical data. Informatics in Medicine Unlocked; 25: 100690. https://doi.org/10.1016/j.imu.2021.100690
[15]. Alruily, M, El-Ghany, SA, et al. 2023. A-tuning ensemble machine learning technique for cerebral stroke prediction. Applied Sciences; 13(8): 5047. https://doi.org/10.3390/app13085047
[16]. Jing, Y. 2022. Machine learning performance analysis to predict stroke based on imbalanced medical dataset. In: CAIBDA 2022; 2nd International Conference on Artificial Intelligence, Big Data and Algorithms. VDE, pp 1-7. https://doi.org/10.48550/arXiv.2211.07652
[17]. Rana, C, Chitre, N, et al. 2021. Stroke prediction using smote-tomek and neural network. In: 2021 12th International Conference on Computing Communication and Networking Technologies (ICCCNT). IEEE, pp 1-5. https://doi.org/10.1109/icccnt51525.2021.9579763
[18]. Sailasya, G, Kumari, GLA. 2021. Analyzing the performance of stroke prediction using ml classification algorithms. International Journal of Advanced Computer Science and Applications; 12(6). https://doi.org/10.14569/ijacsa.2021.0120662
[19]. Maisha, SJ, Biswangri, E, et al. 2022. An approach to detect chronic kidney disease (ckd) by removing noisy and inconsistent values of uci dataset. In: Proceedings of the Third International Conference on Trends in Computational and Cognitive Engineering: TCCE 2021. Springer, pp 457-472. https://doi.org/10.1007/978-981-16-7597-3_38
[20]. Wibowo, MR, Palupi, I. 2023. Enhancing accuracy on chronic-kidney disease detection using machine learning with technique of resampling and missing value treatment. Kinetik: Game Technology, Information System, Computer Network, Computing, Electronics, and Control; 8(3). https://doi.org/10.59287/kinetik.v8i3.1674
[21]. Al Imran, A, Amin, MN, et al. 2018. Classification of chronic kidney disease using logistic regression, feedforward neural network and wide & deep learning. In: 2018 International Conference on Innovation in Engineering and Technology (ICIET). IEEE, pp 1-6. https://doi.org/10.1109/ciet.2018.8660844
[22]. Shams, MY, Tarek, Z, et al. 2025. A novel rfe-gru model for diabetes classification using pima indian dataset. Scientific Reports; 15(1): 982. https://doi.org/10.1038/s41598-024-82420-9
[23]. Rahman, MM, Al-Amin, M, et al. 2024. Machine learning models for chronic kidney disease diagnosis and prediction. Biomedical Signal Processing and Control; 87: 105368. https://doi.org/10.1016/j.bspc.2023.105368
[24]. Eram, AF, Mahmud, AS, et al. 2025. Beyond the numbers: app-enabled stroke prediction system for high-risk individuals in imbalanced datasets. Neuroscience Informatics; 5(3): 100215. https://doi.org/10.1016/j.neuri.2025.100215
[25]. Jiang, K, Lu, J, et al. 2016. A novel algorithm for imbalance data classification based on genetic algorithm improved smote. Arabian Journal for Science and Engineering; 41: 3255-3266. https://doi.org/10.1007/s13369-016-2179-2
[26]. Cervantes, J, Garcia-Lamont, F, et al. 2017. Pso-based method for svm classification on skewed data sets. Neurocomputing; 228: 187-197. https://doi.org/10.1016/j.neucom.2016.10.041
[27]. DeMaris, A. 1995. A tutorial in logistic regression. Journal of Marriage and the Family; 57(4): 956-968. https://doi.org/10.2307/353415
[28]. Dunteman, GH, Ho, M-RH. 2006. An Introduction to Generalized Linear Models. Sage, Thousand Oaks. https://books.google.com.tr/books/about/An_Introduction_to_Generalized_Linear_Mo.html?id=iHvg0QEACAAJ&redir_esc=y
[29]. Biau, G, Scornet, E. 2016. A random forest guided tour. Test; 25(2): 197-227. https://doi.org/10.1007/s11749-016-0481-7
[30]. Breiman, L. 2001. Random forests. Machine Learning; 45(1): 5-32. https://doi.org/10.1023/a:1010933404324
[31]. Ghosh, S, Dasgupta, A, et al. 2019. A study on support vector machine based linear and non-linear pattern classification. In: 2019 International Conference on Intelligent Sustainable Systems (ICISS). IEEE, pp 24-28. https://doi.org/10.1109/iss1.2019.8908018
[32]. Cheng, W-C, Bai, X-D, et al. 2020. Identifying characteristics of pipejacking parameters to assess geological conditions using optimisation algorithm-based support vector machines. Tunnelling and Underground Space Technology; 106: 103592. https://doi.org/10.1016/j.tust.2020.103592
[33]. Costa, VG, Pedreira, CE. 2023. Recent advances in decision trees: an updated survey. Artificial Intelligence Review; 56(5): 4765-4800. https://doi.org/10.1007/s10462-022-10275-5
[34]. Wickramasinghe, I, Kalutarage, H. 2021. Naive bayes: applications, variations and vulnerabilities: a review of literature with code snippets for implementation. Soft Computing; 25(3): 2277-2293. https://doi.org/10.1007/s00500-020-05297-6
[35]. Nguyen, TT, Pham, TD, et al. 2022. A novel intelligence approach based active and ensemble learning for agricultural soil organic carbon prediction using multispectral and sar data fusion. Science of the Total Environment; 804: 150187. https://doi.org/10.1016/j.scitotenv.2021.150187
[36]. Asselman, A, Khaldi, M, et al. 2021. Enhancing the prediction of student performance based on the machine learning xgboost algorithm. Interactive Learning Environments; 31(6): 3360-3379. https://doi.org/10.1080/10494820.2021.1928235
[37]. Kaggle. Kaggle dataset. https://www.kaggle.com/datasets (accessed 16.01.2024).
[38]. UCI Machine Learning Repository. Pima indians diabetes dataset. https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database (accessed 19.09.2023).
[39]. Kaggle. Stroke prediction dataset. https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset (accessed 19.09.2023).
[40]. Kaggle. Chronic kidney disease dataset. https://www.kaggle.com/datasets/mansoordaku/ckdisease (accessed 19.09.2023).
[41]. Rabby, AKMSA, Mamata, R, et al. 2019. Machine learning applied to kidney disease prediction: comparison study. In: 2019 10th International Conference on Computing, Communication and Networking Technologies (ICCCNT). IEEE, Kanpur, India, pp 1-7. https://doi.org/10.1109/ICCCNT45670.2019.8944799
[42]. Senan, EM, Al-Adhaileh, MH, et al. 2021. Diagnosis of chronic kidney disease using effective classification algorithms and recursive feature elimination techniques. Journal of Healthcare Engineering; 2021(1): 1004767. https://doi.org/10.1155/2021/1004767
[43]. Metun, P, Geethanjali, P, et al. 2024. Predictive analysis in nephrology: evaluating machine learning models for chronic kidney disease prediction. International Journal for Research in Applied Science and Engineering Technology (IJRASET); 12(3). https://doi.org/10.22214/ijraset.2024.58102

Details

Primary Language

English

Subjects

Biomedical Diagnosis

Journal Section

Research Article

Authors

Hatice Nizam Özoğur ^*
0000-0002-9722-4355
Türkiye

Zeynep Orman
0000-0002-0205-4198
Türkiye

Publication Date

June 30, 2026

Submission Date

August 16, 2025

Acceptance Date

January 26, 2026

Published in Issue

Year 2026 Volume: 22 Number: 2

DOI

https://doi.org/10.18466/cbayarfbe.1766229

IZ

https://izlik.org/JA26BW78MX

Cite

RIS / Bibtex

APA

Nizam Özoğur, H., & Orman, Z. (2026). A Genetic Algorithm-Enhanced Method for Missing Value Imputation in Healthcare Datasets. Celal Bayar University Journal of Science, 22(2), 225-235. https://doi.org/10.18466/cbayarfbe.1766229

AMA

1.Nizam Özoğur H, Orman Z. A Genetic Algorithm-Enhanced Method for Missing Value Imputation in Healthcare Datasets. CBUJOS. 2026;22(2):225-235. doi:10.18466/cbayarfbe.1766229

Chicago

Nizam Özoğur, Hatice, and Zeynep Orman. 2026. “A Genetic Algorithm-Enhanced Method for Missing Value Imputation in Healthcare Datasets”. Celal Bayar University Journal of Science 22 (2): 225-35. https://doi.org/10.18466/cbayarfbe.1766229.

EndNote

Nizam Özoğur H, Orman Z (June 1, 2026) A Genetic Algorithm-Enhanced Method for Missing Value Imputation in Healthcare Datasets. Celal Bayar University Journal of Science 22 2 225–235.

IEEE

[1]H. Nizam Özoğur and Z. Orman, “A Genetic Algorithm-Enhanced Method for Missing Value Imputation in Healthcare Datasets”, CBUJOS, vol. 22, no. 2, pp. 225–235, June 2026, doi: 10.18466/cbayarfbe.1766229.

ISNAD

Nizam Özoğur, Hatice - Orman, Zeynep. “A Genetic Algorithm-Enhanced Method for Missing Value Imputation in Healthcare Datasets”. Celal Bayar University Journal of Science 22/2 (June 1, 2026): 225-235. https://doi.org/10.18466/cbayarfbe.1766229.

JAMA

1.Nizam Özoğur H, Orman Z. A Genetic Algorithm-Enhanced Method for Missing Value Imputation in Healthcare Datasets. CBUJOS. 2026;22:225–235.

MLA

Nizam Özoğur, Hatice, and Zeynep Orman. “A Genetic Algorithm-Enhanced Method for Missing Value Imputation in Healthcare Datasets”. Celal Bayar University Journal of Science, vol. 22, no. 2, June 2026, pp. 225-3, doi:10.18466/cbayarfbe.1766229.

Vancouver

1.Hatice Nizam Özoğur, Zeynep Orman. A Genetic Algorithm-Enhanced Method for Missing Value Imputation in Healthcare Datasets. CBUJOS. 2026 Jun. 1;22(2):225-3. doi:10.18466/cbayarfbe.1766229