Research Article
BibTex RIS Cite

A Study on Missing Data Problem in Random Forest

Year 2020, Volume: 42 Issue: 1, 103 - 109, 01.01.2020
https://doi.org/10.20515/otd.496524

Abstract

Random Forest is an ensemble method
that combines many trees constructed from bootstrap samples of the original
data. Random Forest is used for both classification and regression and provides
many advantages such as having a high accuracy, calculating a generalization
error, determining the important variables and outliers, performing supervised
and unsupervised learning and imputing missing values with an algorithm based
on proximity matrix. In this study, we aimed to compare the proximity based
imputation method of Random Forest with k nearest neighbor imputation prior to
fitting. Therefore, simulation studies were performed for a classification
problem under various scenarios including different percentage of missing
values, number of neighbors and correlation structures between predictor
variables. The results showed that for highly correlated structures proximity
matrix based imputation method should be used meanwhile k nearest neighbor
imputation method should be preferred for low and medium correlated structures.

References

  • Breiman L. Random forests. Mach Learn. 2001;45(1):5-32.
  • Cutler A, Cutler D, Stevens J, Zhang C, Ma Y. Ensemble Machine Learning: Methods and Applications. Springer Science+ Business Media, LLC; 2012.
  • Qi Y. Random forest for bioinformatics. Ensemble machine learning: Springer; 2012. p. 307-23.
  • Moorthy K, Mohamad MS, Deris S, editors. Multiclass Prediction for Cancer Microarray Data Using Various Variables Range Selection Based on Random Forest. Pacific-Asia Conference on Knowledge Discovery and Data Mining; 2013: Springer.
  • Wu X, Wu Z, Li K, editors. Classification and identification of differential gene expression for microarray data: improvement of the random forest method. 2008 2nd International Conference on Bioinformatics and Biomedical Engineering; 2008: IEEE.
  • Cutler DR, Edwards TC, Beard KH, Cutler A, Hess KT, Gibson J, et al. Random forests for classification in ecology. Ecology. 2007;88(11):2783-92.
  • Pantanowitz A, Marwala T. Evaluating the Impact of Missing Data Imputation through the use of the Random Forest Algorithm. arXiv preprint arXiv:08122412. 2008.
  • Soley-Bori M. Dealing with missing data: Key assumptions and methods for applied analysis. Boston University. 2013.
  • Takahashi M, Ito T. Multiple Imputation of Turnover in EDINET Data: Toward the Improvement of Imputation for the Economic Census. Work Session on Statistical Data Editing, UNECE. 2012:24-6.
  • Scheel I, Aldrin M, Glad IK, Sorum R, Lyng H, Frigessi A. The influence of missing value imputation on detection of differentially expressed genes from microarray data. Bioinformatics. 2005;21(23):4272-9.
  • Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, et al. Missing value estimation methods for DNA microarrays. Bioinformatics. 2001;17(6):520-5.
  • Rieger A, Hothorn T, Strobl C. Random forests with missing values in the covariates. 2010.
  • Breiman L, Cutler A. RFtools—for predicting and understanding data. Berkeley, CA, USA, Tech Rep. 2004.
  • Acuna E, Rodriguez C. The treatment of missing values and its effect on classifier accuracy. Classification, clustering, and data mining applications: Springer; 2004. p. 639-47.
  • Lee M-LT. Analysis of microarray gene expression data: Springer Science & Business Media; 2007.
  • Ouyang M, Welsh WJ, Georgopoulos P. Gaussian mixture clustering and imputation of microarray data. Bioinformatics. 2004;20(6):917-23.

Rasgele Orman Yönteminde Eksik Veri Probleminin İncelenmesi

Year 2020, Volume: 42 Issue: 1, 103 - 109, 01.01.2020
https://doi.org/10.20515/otd.496524

Abstract

Rasgele Orman, orijinal
verilerin bootstrap örneklerinden oluşturulmuş pek çok karar ağacını bir araya
getiren bir topluluk yöntemidir. Rasgele Orman, hem sınıflandırma hem de
regresyon için kullanılır ve yüksek doğruluk oranı elde etme, genelleme hatası
hesaplama, önemli değişkenleri ve aykırı değerleri belirleme, danışmanlı ve
danışmansız öğrenmeyi gerçekleştirme ve yakınlık matrisine dayalı bir algoritma
ile eksik gözlemlere değer atama gibi birçok avantaj sağlar. Bu çalışmada,
Rasgele Orman’ın yakınlık matrisi temelli atama yöntemini, model kurulumundan
önce kullanılan en yakın komşu ile değer atama yöntemiyle karşılaştırmayı
amaçladık. Bu nedenle, farklı eksik değer yüzdeleri, komşuluk sayısı ve
tahminci değişkenler arasındaki korelasyon yapıları dahil olmak üzere çeşitli
senaryolar altında bir sınıflandırma problemi için simülasyon çalışması
yapılmıştır. Sonuçlar, yüksek korelasyonlu yapılar için yakınlık matrisi
tabanlı atama yönteminin kullanılması gerektiğini, orta ve düşük korelasyonlu
yapılar için ise en yakın komşu ile değer atama yönteminin tercih edilmesi
gerektiğini göstermektedir.

References

  • Breiman L. Random forests. Mach Learn. 2001;45(1):5-32.
  • Cutler A, Cutler D, Stevens J, Zhang C, Ma Y. Ensemble Machine Learning: Methods and Applications. Springer Science+ Business Media, LLC; 2012.
  • Qi Y. Random forest for bioinformatics. Ensemble machine learning: Springer; 2012. p. 307-23.
  • Moorthy K, Mohamad MS, Deris S, editors. Multiclass Prediction for Cancer Microarray Data Using Various Variables Range Selection Based on Random Forest. Pacific-Asia Conference on Knowledge Discovery and Data Mining; 2013: Springer.
  • Wu X, Wu Z, Li K, editors. Classification and identification of differential gene expression for microarray data: improvement of the random forest method. 2008 2nd International Conference on Bioinformatics and Biomedical Engineering; 2008: IEEE.
  • Cutler DR, Edwards TC, Beard KH, Cutler A, Hess KT, Gibson J, et al. Random forests for classification in ecology. Ecology. 2007;88(11):2783-92.
  • Pantanowitz A, Marwala T. Evaluating the Impact of Missing Data Imputation through the use of the Random Forest Algorithm. arXiv preprint arXiv:08122412. 2008.
  • Soley-Bori M. Dealing with missing data: Key assumptions and methods for applied analysis. Boston University. 2013.
  • Takahashi M, Ito T. Multiple Imputation of Turnover in EDINET Data: Toward the Improvement of Imputation for the Economic Census. Work Session on Statistical Data Editing, UNECE. 2012:24-6.
  • Scheel I, Aldrin M, Glad IK, Sorum R, Lyng H, Frigessi A. The influence of missing value imputation on detection of differentially expressed genes from microarray data. Bioinformatics. 2005;21(23):4272-9.
  • Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, et al. Missing value estimation methods for DNA microarrays. Bioinformatics. 2001;17(6):520-5.
  • Rieger A, Hothorn T, Strobl C. Random forests with missing values in the covariates. 2010.
  • Breiman L, Cutler A. RFtools—for predicting and understanding data. Berkeley, CA, USA, Tech Rep. 2004.
  • Acuna E, Rodriguez C. The treatment of missing values and its effect on classifier accuracy. Classification, clustering, and data mining applications: Springer; 2004. p. 639-47.
  • Lee M-LT. Analysis of microarray gene expression data: Springer Science & Business Media; 2007.
  • Ouyang M, Welsh WJ, Georgopoulos P. Gaussian mixture clustering and imputation of microarray data. Bioinformatics. 2004;20(6):917-23.
There are 16 citations in total.

Details

Primary Language English
Subjects Health Care Administration
Journal Section ORİJİNAL MAKALE
Authors

Hülya Özen 0000-0003-4144-3732

Cengiz Bal 0000-0002-1553-2902

Publication Date January 1, 2020
Published in Issue Year 2020 Volume: 42 Issue: 1

Cite

Vancouver Özen H, Bal C. A Study on Missing Data Problem in Random Forest. Osmangazi Tıp Dergisi. 2020;42(1):103-9.


13299        13308       13306       13305    13307  1330126978