A Study on Missing Data Problem in Random Forest
Öz
Random Forest is an ensemble method that combines many trees constructed from bootstrap samples of the original data. Random Forest is used for both classification and regression and provides many advantages such as having a high accuracy, calculating a generalization error, determining the important variables and outliers, performing supervised and unsupervised learning and imputing missing values with an algorithm based on proximity matrix. In this study, we aimed to compare the proximity based imputation method of Random Forest with k nearest neighbor imputation prior to fitting. Therefore, simulation studies were performed for a classification problem under various scenarios including different percentage of missing values, number of neighbors and correlation structures between predictor variables. The results showed that for highly correlated structures proximity matrix based imputation method should be used meanwhile k nearest neighbor imputation method should be preferred for low and medium correlated structures.
Anahtar Kelimeler
Kaynakça
- Breiman L. Random forests. Mach Learn. 2001;45(1):5-32.
- Cutler A, Cutler D, Stevens J, Zhang C, Ma Y. Ensemble Machine Learning: Methods and Applications. Springer Science+ Business Media, LLC; 2012.
- Qi Y. Random forest for bioinformatics. Ensemble machine learning: Springer; 2012. p. 307-23.
- Moorthy K, Mohamad MS, Deris S, editors. Multiclass Prediction for Cancer Microarray Data Using Various Variables Range Selection Based on Random Forest. Pacific-Asia Conference on Knowledge Discovery and Data Mining; 2013: Springer.
- Wu X, Wu Z, Li K, editors. Classification and identification of differential gene expression for microarray data: improvement of the random forest method. 2008 2nd International Conference on Bioinformatics and Biomedical Engineering; 2008: IEEE.
- Cutler DR, Edwards TC, Beard KH, Cutler A, Hess KT, Gibson J, et al. Random forests for classification in ecology. Ecology. 2007;88(11):2783-92.
- Pantanowitz A, Marwala T. Evaluating the Impact of Missing Data Imputation through the use of the Random Forest Algorithm. arXiv preprint arXiv:08122412. 2008.
- Soley-Bori M. Dealing with missing data: Key assumptions and methods for applied analysis. Boston University. 2013.
Ayrıntılar
Birincil Dil
İngilizce
Konular
Sağlık Kurumları Yönetimi
Bölüm
Araştırma Makalesi
Yayımlanma Tarihi
1 Ocak 2020
Gönderilme Tarihi
13 Aralık 2018
Kabul Tarihi
1 Mart 2019
Yayımlandığı Sayı
Yıl 2020 Cilt: 42 Sayı: 1
Cited By
CNN TABANLI DERİN ÖĞRENME VE MAKİNE ÖĞRENMESİ TEKNİKLERİNİN ENTEGRASYONU: İŞTEN AYRILMA TAHMİNLERİNDE YENİ BİR METODOLOJİ
International Journal of Management Economics and Business
https://doi.org/10.17130/ijmeb.1529822Simulation comparison of the effects of missing data imputation methods on classification performance in high dimensional data
Communications in Statistics - Simulation and Computation
https://doi.org/10.1080/03610918.2025.2524548