A Genetic Algorithm-Enhanced Method for Missing Value Imputation in Healthcare Datasets
Abstract
In healthcare datasets, imbalanced class distributions and missing data pose significant challenges to the performance and stability of machine learning models, thereby hindering accurate analysis and disease diagnosis. Addressing these challenges is crucial for improving both the precision and reliability of healthcare data analysis. This paper proposes a novel preprocessing framework specifically designed for healthcare datasets to mitigate issues related to incomplete data and class imbalance. The framework introduces a new imputation method, GA-MICE, which enhances the Multiple Imputation by Chained Equations (MICE) technique using a Genetic Algorithm (GA) to improve the accuracy of handling missing data. Additionally, the framework incorporates the GASMOTEPSO_ENN method, which combines the Synthetic Minority Oversampling Technique (SMOTE) and Edited Nearest Neighbor (ENN) algorithms with GA and Particle Swarm Optimization (PSO) heuristics to effectively address class imbalance. After preprocessing, six machine learning classifiers are employed to categorize individuals as either patients or healthy subjects. The model's performance is evaluated using multiple metrics, including accuracy, precision, recall, F1-score, Matthews Correlation Coefficient (MCC), and Area Under the Curve (AUC). Experimental results demonstrate the effectiveness of the proposed approach in managing missing data and addressing class imbalance, achieving performance close to or exceeding existing methodologies reported in the literature.
Keywords
Ethical Statement
References
- [1]. García-Laencina, PJ, Sancho-Gómez, JL, et al. 2010. Pattern classification with missing data: a review. Neural Computing and Applications; 19: 263-282. https://doi.org/10.1007/s00521-009-0295-6
- [2]. Lin, W-C, Ke, S-W, et al. 2017. When should we ignore examples with missing values? International Journal of Data Warehousing and Mining (IJDWM); 13(4): 53-63. https://doi.org/10.4018/ijdwm.2017100104
- [3]. Bertsimas, D, Pawlowski, C, et al. 2018. From predictive methods to missing data imputation: an optimization approach. Journal of Machine Learning Research; 18(196): 1-39. https://www.jmlr.org/papers/v18/17-073.html
- [4]. Lin, W-C, Tsai, C-F, et al. 2022. Deep learning for missing value imputation of continuous data and the effect of data discretization. Knowledge-Based Systems; 239: 108079. https://doi.org/10.1016/j.knosys.2021.108079
- [5]. Nizam-Ozogur, H, Orman, Z. 2024. A heuristic-based hybrid sampling method using a combination of smote and enn for imbalanced health data. Expert Systems; 41(8): e13596. https://doi.org/10.1111/exsy.13596
- [6]. Parhi, SK, Patro, SK. 2023. Compressive strength prediction of pet fiber-reinforced concrete using dolphin echolocation optimized decision tree-based machine learning algorithms. Asian Journal of Civil Engineering; 25(1): 977-996. https://doi.org/10.1007/s42107-023-00826-8
- [7]. Zhou, X-H, Eckert, GJ, et al. 2001. Multiple imputation in public health research. Statistics in Medicine; 20(9-10): 1541-1549. https://doi.org/10.1002/sim.689
- [8]. Khan, SI, Hoque, ASML. 2020. Sice: an improved missing data imputation technique. Journal of Big Data; 7(1): 37. https://doi.org/10.1186/s40537-020-00313-w
Details
Primary Language
English
Subjects
Biomedical Diagnosis
Journal Section
Research Article
Publication Date
June 30, 2026
Submission Date
August 16, 2025
Acceptance Date
January 26, 2026
Published in Issue
Year 2026 Volume: 22 Number: 2