Araştırma Makalesi

A Study on Missing Data Problem in Random Forest

Cilt: 42 Sayı: 1 1 Ocak 2020
PDF İndir
EN TR

A Study on Missing Data Problem in Random Forest

Öz

Random Forest is an ensemble method that combines many trees constructed from bootstrap samples of the original data. Random Forest is used for both classification and regression and provides many advantages such as having a high accuracy, calculating a generalization error, determining the important variables and outliers, performing supervised and unsupervised learning and imputing missing values with an algorithm based on proximity matrix. In this study, we aimed to compare the proximity based imputation method of Random Forest with k nearest neighbor imputation prior to fitting. Therefore, simulation studies were performed for a classification problem under various scenarios including different percentage of missing values, number of neighbors and correlation structures between predictor variables. The results showed that for highly correlated structures proximity matrix based imputation method should be used meanwhile k nearest neighbor imputation method should be preferred for low and medium correlated structures.

Anahtar Kelimeler

Kaynakça

  1. Breiman L. Random forests. Mach Learn. 2001;45(1):5-32.
  2. Cutler A, Cutler D, Stevens J, Zhang C, Ma Y. Ensemble Machine Learning: Methods and Applications. Springer Science+ Business Media, LLC; 2012.
  3. Qi Y. Random forest for bioinformatics. Ensemble machine learning: Springer; 2012. p. 307-23.
  4. Moorthy K, Mohamad MS, Deris S, editors. Multiclass Prediction for Cancer Microarray Data Using Various Variables Range Selection Based on Random Forest. Pacific-Asia Conference on Knowledge Discovery and Data Mining; 2013: Springer.
  5. Wu X, Wu Z, Li K, editors. Classification and identification of differential gene expression for microarray data: improvement of the random forest method. 2008 2nd International Conference on Bioinformatics and Biomedical Engineering; 2008: IEEE.
  6. Cutler DR, Edwards TC, Beard KH, Cutler A, Hess KT, Gibson J, et al. Random forests for classification in ecology. Ecology. 2007;88(11):2783-92.
  7. Pantanowitz A, Marwala T. Evaluating the Impact of Missing Data Imputation through the use of the Random Forest Algorithm. arXiv preprint arXiv:08122412. 2008.
  8. Soley-Bori M. Dealing with missing data: Key assumptions and methods for applied analysis. Boston University. 2013.

Ayrıntılar

Birincil Dil

İngilizce

Konular

Sağlık Kurumları Yönetimi

Bölüm

Araştırma Makalesi

Yayımlanma Tarihi

1 Ocak 2020

Gönderilme Tarihi

13 Aralık 2018

Kabul Tarihi

1 Mart 2019

Yayımlandığı Sayı

Yıl 2020 Cilt: 42 Sayı: 1

Kaynak Göster

APA
Özen, H., & Bal, C. (2020). A Study on Missing Data Problem in Random Forest. Osmangazi Tıp Dergisi, 42(1), 103-109. https://doi.org/10.20515/otd.496524
AMA
1.Özen H, Bal C. A Study on Missing Data Problem in Random Forest. Osmangazi Tıp Dergisi. 2020;42(1):103-109. doi:10.20515/otd.496524
Chicago
Özen, Hülya, ve Cengiz Bal. 2020. “A Study on Missing Data Problem in Random Forest”. Osmangazi Tıp Dergisi 42 (1): 103-9. https://doi.org/10.20515/otd.496524.
EndNote
Özen H, Bal C (01 Ocak 2020) A Study on Missing Data Problem in Random Forest. Osmangazi Tıp Dergisi 42 1 103–109.
IEEE
[1]H. Özen ve C. Bal, “A Study on Missing Data Problem in Random Forest”, Osmangazi Tıp Dergisi, c. 42, sy 1, ss. 103–109, Oca. 2020, doi: 10.20515/otd.496524.
ISNAD
Özen, Hülya - Bal, Cengiz. “A Study on Missing Data Problem in Random Forest”. Osmangazi Tıp Dergisi 42/1 (01 Ocak 2020): 103-109. https://doi.org/10.20515/otd.496524.
JAMA
1.Özen H, Bal C. A Study on Missing Data Problem in Random Forest. Osmangazi Tıp Dergisi. 2020;42:103–109.
MLA
Özen, Hülya, ve Cengiz Bal. “A Study on Missing Data Problem in Random Forest”. Osmangazi Tıp Dergisi, c. 42, sy 1, Ocak 2020, ss. 103-9, doi:10.20515/otd.496524.
Vancouver
1.Hülya Özen, Cengiz Bal. A Study on Missing Data Problem in Random Forest. Osmangazi Tıp Dergisi. 01 Ocak 2020;42(1):103-9. doi:10.20515/otd.496524

Cited By


13299        13308       13306       13305    13307  1330126978