A Study on Missing Data Problem in Random Forest

Hülya Özen; Cengiz Bal

doi:10.20515/otd.496524

Araştırma Makalesi

A Study on Missing Data Problem in Random Forest

Yıl 2020, Cilt: 42 Sayı: 1, 103 - 109, 01.01.2020

Hülya Özen , Cengiz Bal

https://doi.org/10.20515/otd.496524

Cited By: 2

Öz

Random Forest is an ensemble method
that combines many trees constructed from bootstrap samples of the original
data. Random Forest is used for both classification and regression and provides
many advantages such as having a high accuracy, calculating a generalization
error, determining the important variables and outliers, performing supervised
and unsupervised learning and imputing missing values with an algorithm based
on proximity matrix. In this study, we aimed to compare the proximity based
imputation method of Random Forest with k nearest neighbor imputation prior to
fitting. Therefore, simulation studies were performed for a classification
problem under various scenarios including different percentage of missing
values, number of neighbors and correlation structures between predictor
variables. The results showed that for highly correlated structures proximity
matrix based imputation method should be used meanwhile k nearest neighbor
imputation method should be preferred for low and medium correlated structures.

Anahtar Kelimeler

knn imputation method , Missing value , proximity matrix , Random Forests

Kaynakça

Breiman L. Random forests. Mach Learn. 2001;45(1):5-32.
Cutler A, Cutler D, Stevens J, Zhang C, Ma Y. Ensemble Machine Learning: Methods and Applications. Springer Science+ Business Media, LLC; 2012.
Qi Y. Random forest for bioinformatics. Ensemble machine learning: Springer; 2012. p. 307-23.
Moorthy K, Mohamad MS, Deris S, editors. Multiclass Prediction for Cancer Microarray Data Using Various Variables Range Selection Based on Random Forest. Pacific-Asia Conference on Knowledge Discovery and Data Mining; 2013: Springer.
Wu X, Wu Z, Li K, editors. Classification and identification of differential gene expression for microarray data: improvement of the random forest method. 2008 2nd International Conference on Bioinformatics and Biomedical Engineering; 2008: IEEE.
Cutler DR, Edwards TC, Beard KH, Cutler A, Hess KT, Gibson J, et al. Random forests for classification in ecology. Ecology. 2007;88(11):2783-92.
Pantanowitz A, Marwala T. Evaluating the Impact of Missing Data Imputation through the use of the Random Forest Algorithm. arXiv preprint arXiv:08122412. 2008.
Soley-Bori M. Dealing with missing data: Key assumptions and methods for applied analysis. Boston University. 2013.
Takahashi M, Ito T. Multiple Imputation of Turnover in EDINET Data: Toward the Improvement of Imputation for the Economic Census. Work Session on Statistical Data Editing, UNECE. 2012:24-6.
Scheel I, Aldrin M, Glad IK, Sorum R, Lyng H, Frigessi A. The influence of missing value imputation on detection of differentially expressed genes from microarray data. Bioinformatics. 2005;21(23):4272-9.
Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, et al. Missing value estimation methods for DNA microarrays. Bioinformatics. 2001;17(6):520-5.
Rieger A, Hothorn T, Strobl C. Random forests with missing values in the covariates. 2010.
Breiman L, Cutler A. RFtools—for predicting and understanding data. Berkeley, CA, USA, Tech Rep. 2004.
Acuna E, Rodriguez C. The treatment of missing values and its effect on classifier accuracy. Classification, clustering, and data mining applications: Springer; 2004. p. 639-47.
Lee M-LT. Analysis of microarray gene expression data: Springer Science & Business Media; 2007.
Ouyang M, Welsh WJ, Georgopoulos P. Gaussian mixture clustering and imputation of microarray data. Bioinformatics. 2004;20(6):917-23.

Rasgele Orman Yönteminde Eksik Veri Probleminin İncelenmesi

Yıl 2020, Cilt: 42 Sayı: 1, 103 - 109, 01.01.2020

Hülya Özen , Cengiz Bal

https://doi.org/10.20515/otd.496524

Cited By: 2

Öz

Rasgele Orman, orijinal
verilerin bootstrap örneklerinden oluşturulmuş pek çok karar ağacını bir araya
getiren bir topluluk yöntemidir. Rasgele Orman, hem sınıflandırma hem de
regresyon için kullanılır ve yüksek doğruluk oranı elde etme, genelleme hatası
hesaplama, önemli değişkenleri ve aykırı değerleri belirleme, danışmanlı ve
danışmansız öğrenmeyi gerçekleştirme ve yakınlık matrisine dayalı bir algoritma
ile eksik gözlemlere değer atama gibi birçok avantaj sağlar. Bu çalışmada,
Rasgele Orman’ın yakınlık matrisi temelli atama yöntemini, model kurulumundan
önce kullanılan en yakın komşu ile değer atama yöntemiyle karşılaştırmayı
amaçladık. Bu nedenle, farklı eksik değer yüzdeleri, komşuluk sayısı ve
tahminci değişkenler arasındaki korelasyon yapıları dahil olmak üzere çeşitli
senaryolar altında bir sınıflandırma problemi için simülasyon çalışması
yapılmıştır. Sonuçlar, yüksek korelasyonlu yapılar için yakınlık matrisi
tabanlı atama yönteminin kullanılması gerektiğini, orta ve düşük korelasyonlu
yapılar için ise en yakın komşu ile değer atama yönteminin tercih edilmesi
gerektiğini göstermektedir.

Anahtar Kelimeler

knn Atama Yöntemi , Eksik Değer , Yakınlık Matrisi , Random Forests

Kaynakça

Breiman L. Random forests. Mach Learn. 2001;45(1):5-32.
Cutler A, Cutler D, Stevens J, Zhang C, Ma Y. Ensemble Machine Learning: Methods and Applications. Springer Science+ Business Media, LLC; 2012.
Qi Y. Random forest for bioinformatics. Ensemble machine learning: Springer; 2012. p. 307-23.
Moorthy K, Mohamad MS, Deris S, editors. Multiclass Prediction for Cancer Microarray Data Using Various Variables Range Selection Based on Random Forest. Pacific-Asia Conference on Knowledge Discovery and Data Mining; 2013: Springer.
Wu X, Wu Z, Li K, editors. Classification and identification of differential gene expression for microarray data: improvement of the random forest method. 2008 2nd International Conference on Bioinformatics and Biomedical Engineering; 2008: IEEE.
Cutler DR, Edwards TC, Beard KH, Cutler A, Hess KT, Gibson J, et al. Random forests for classification in ecology. Ecology. 2007;88(11):2783-92.
Pantanowitz A, Marwala T. Evaluating the Impact of Missing Data Imputation through the use of the Random Forest Algorithm. arXiv preprint arXiv:08122412. 2008.
Soley-Bori M. Dealing with missing data: Key assumptions and methods for applied analysis. Boston University. 2013.
Takahashi M, Ito T. Multiple Imputation of Turnover in EDINET Data: Toward the Improvement of Imputation for the Economic Census. Work Session on Statistical Data Editing, UNECE. 2012:24-6.
Scheel I, Aldrin M, Glad IK, Sorum R, Lyng H, Frigessi A. The influence of missing value imputation on detection of differentially expressed genes from microarray data. Bioinformatics. 2005;21(23):4272-9.
Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, et al. Missing value estimation methods for DNA microarrays. Bioinformatics. 2001;17(6):520-5.
Rieger A, Hothorn T, Strobl C. Random forests with missing values in the covariates. 2010.
Breiman L, Cutler A. RFtools—for predicting and understanding data. Berkeley, CA, USA, Tech Rep. 2004.
Acuna E, Rodriguez C. The treatment of missing values and its effect on classifier accuracy. Classification, clustering, and data mining applications: Springer; 2004. p. 639-47.
Lee M-LT. Analysis of microarray gene expression data: Springer Science & Business Media; 2007.
Ouyang M, Welsh WJ, Georgopoulos P. Gaussian mixture clustering and imputation of microarray data. Bioinformatics. 2004;20(6):917-23.

Toplam 16 adet kaynakça vardır.

Ayrıntılar

Birincil Dil	İngilizce
Konular	Sağlık Kurumları Yönetimi
Bölüm	ORİJİNAL MAKALELER / ORIGINAL ARTICLES
Yazarlar	Hülya Özen 0000-0003-4144-3732 Cengiz Bal 0000-0002-1553-2902
Yayımlanma Tarihi	1 Ocak 2020
Yayımlandığı Sayı	Yıl 2020 Cilt: 42 Sayı: 1