A Genetic Algorithm-Enhanced Method for Missing Value Imputation in Healthcare Datasets
Öz
In healthcare datasets, imbalanced class distributions and missing data pose significant challenges to the performance and stability of machine learning models, thereby hindering accurate analysis and disease diagnosis. Addressing these challenges is crucial for improving both the precision and reliability of healthcare data analysis. This paper proposes a novel preprocessing framework specifically designed for healthcare datasets to mitigate issues related to incomplete data and class imbalance. The framework introduces a new imputation method, GA-MICE, which enhances the Multiple Imputation by Chained Equations (MICE) technique using a Genetic Algorithm (GA) to improve the accuracy of handling missing data. Additionally, the framework incorporates the GASMOTEPSO_ENN method, which combines the Synthetic Minority Oversampling Technique (SMOTE) and Edited Nearest Neighbor (ENN) algorithms with GA and Particle Swarm Optimization (PSO) heuristics to effectively address class imbalance. After preprocessing, six machine learning classifiers are employed to categorize individuals as either patients or healthy subjects. The model's performance is evaluated using multiple metrics, including accuracy, precision, recall, F1-score, Matthews Correlation Coefficient (MCC), and Area Under the Curve (AUC). Experimental results demonstrate the effectiveness of the proposed approach in managing missing data and addressing class imbalance, achieving performance close to or exceeding existing methodologies reported in the literature.
Anahtar Kelimeler
Etik Beyan
Kaynakça
- [1]. García-Laencina, PJ, Sancho-Gómez, JL, et al. 2010. Pattern classification with missing data: a review. Neural Computing and Applications; 19: 263-282. https://doi.org/10.1007/s00521-009-0295-6
- [2]. Lin, W-C, Ke, S-W, et al. 2017. When should we ignore examples with missing values? International Journal of Data Warehousing and Mining (IJDWM); 13(4): 53-63. https://doi.org/10.4018/ijdwm.2017100104
- [3]. Bertsimas, D, Pawlowski, C, et al. 2018. From predictive methods to missing data imputation: an optimization approach. Journal of Machine Learning Research; 18(196): 1-39. https://www.jmlr.org/papers/v18/17-073.html
- [4]. Lin, W-C, Tsai, C-F, et al. 2022. Deep learning for missing value imputation of continuous data and the effect of data discretization. Knowledge-Based Systems; 239: 108079. https://doi.org/10.1016/j.knosys.2021.108079
- [5]. Nizam-Ozogur, H, Orman, Z. 2024. A heuristic-based hybrid sampling method using a combination of smote and enn for imbalanced health data. Expert Systems; 41(8): e13596. https://doi.org/10.1111/exsy.13596
- [6]. Parhi, SK, Patro, SK. 2023. Compressive strength prediction of pet fiber-reinforced concrete using dolphin echolocation optimized decision tree-based machine learning algorithms. Asian Journal of Civil Engineering; 25(1): 977-996. https://doi.org/10.1007/s42107-023-00826-8
- [7]. Zhou, X-H, Eckert, GJ, et al. 2001. Multiple imputation in public health research. Statistics in Medicine; 20(9-10): 1541-1549. https://doi.org/10.1002/sim.689
- [8]. Khan, SI, Hoque, ASML. 2020. Sice: an improved missing data imputation technique. Journal of Big Data; 7(1): 37. https://doi.org/10.1186/s40537-020-00313-w
Ayrıntılar
Birincil Dil
İngilizce
Konular
Biyomedikal Tanı
Bölüm
Araştırma Makalesi
Yayımlanma Tarihi
30 Haziran 2026
Gönderilme Tarihi
16 Ağustos 2025
Kabul Tarihi
26 Ocak 2026
Yayımlandığı Sayı
Yıl 2026 Cilt: 22 Sayı: 2