A Study on Missing Data Problem in Random Forest

Hülya Özen; Cengiz Bal

doi:10.20515/otd.496524

EN TR

A Study on Missing Data Problem in Random Forest

Öz

Random Forest is an ensemble method that combines many trees constructed from bootstrap samples of the original data. Random Forest is used for both classification and regression and provides many advantages such as having a high accuracy, calculating a generalization error, determining the important variables and outliers, performing supervised and unsupervised learning and imputing missing values with an algorithm based on proximity matrix. In this study, we aimed to compare the proximity based imputation method of Random Forest with k nearest neighbor imputation prior to fitting. Therefore, simulation studies were performed for a classification problem under various scenarios including different percentage of missing values, number of neighbors and correlation structures between predictor variables. The results showed that for highly correlated structures proximity matrix based imputation method should be used meanwhile k nearest neighbor imputation method should be preferred for low and medium correlated structures.

Anahtar Kelimeler

knn imputation method,Missing value,proximity matrix,Random Forests

Kaynakça

Breiman L. Random forests. Mach Learn. 2001;45(1):5-32.
Cutler A, Cutler D, Stevens J, Zhang C, Ma Y. Ensemble Machine Learning: Methods and Applications. Springer Science+ Business Media, LLC; 2012.
Qi Y. Random forest for bioinformatics. Ensemble machine learning: Springer; 2012. p. 307-23.
Moorthy K, Mohamad MS, Deris S, editors. Multiclass Prediction for Cancer Microarray Data Using Various Variables Range Selection Based on Random Forest. Pacific-Asia Conference on Knowledge Discovery and Data Mining; 2013: Springer.
Wu X, Wu Z, Li K, editors. Classification and identification of differential gene expression for microarray data: improvement of the random forest method. 2008 2nd International Conference on Bioinformatics and Biomedical Engineering; 2008: IEEE.
Cutler DR, Edwards TC, Beard KH, Cutler A, Hess KT, Gibson J, et al. Random forests for classification in ecology. Ecology. 2007;88(11):2783-92.
Pantanowitz A, Marwala T. Evaluating the Impact of Missing Data Imputation through the use of the Random Forest Algorithm. arXiv preprint arXiv:08122412. 2008.
Soley-Bori M. Dealing with missing data: Key assumptions and methods for applied analysis. Boston University. 2013.

Ayrıntılar

Birincil Dil

İngilizce

Konular

Sağlık Kurumları Yönetimi

Bölüm

Araştırma Makalesi

Yazarlar

Hülya Özen ^*
0000-0003-4144-3732
Türkiye

Cengiz Bal
0000-0002-1553-2902
Türkiye

Yayımlanma Tarihi

1 Ocak 2020

Gönderilme Tarihi

13 Aralık 2018

Kabul Tarihi

1 Mart 2019

Yayımlandığı Sayı

Yıl 2020 Cilt: 42 Sayı: 1

DOI

https://doi.org/10.20515/otd.496524

IZ

https://izlik.org/JA65RT95TB

Kaynak Göster

RIS / Bibtex

APA

Özen, H., & Bal, C. (2020). A Study on Missing Data Problem in Random Forest. Osmangazi Tıp Dergisi, 42(1), 103-109. https://doi.org/10.20515/otd.496524

AMA

1.Özen H, Bal C. A Study on Missing Data Problem in Random Forest. Osmangazi Tıp Dergisi. 2020;42(1):103-109. doi:10.20515/otd.496524

Chicago

Özen, Hülya, ve Cengiz Bal. 2020. “A Study on Missing Data Problem in Random Forest”. Osmangazi Tıp Dergisi 42 (1): 103-9. https://doi.org/10.20515/otd.496524.

EndNote

Özen H, Bal C (01 Ocak 2020) A Study on Missing Data Problem in Random Forest. Osmangazi Tıp Dergisi 42 1 103–109.

IEEE

[1]H. Özen ve C. Bal, “A Study on Missing Data Problem in Random Forest”, Osmangazi Tıp Dergisi, c. 42, sy 1, ss. 103–109, Oca. 2020, doi: 10.20515/otd.496524.

ISNAD

Özen, Hülya - Bal, Cengiz. “A Study on Missing Data Problem in Random Forest”. Osmangazi Tıp Dergisi 42/1 (01 Ocak 2020): 103-109. https://doi.org/10.20515/otd.496524.

JAMA

1.Özen H, Bal C. A Study on Missing Data Problem in Random Forest. Osmangazi Tıp Dergisi. 2020;42:103–109.

MLA

Özen, Hülya, ve Cengiz Bal. “A Study on Missing Data Problem in Random Forest”. Osmangazi Tıp Dergisi, c. 42, sy 1, Ocak 2020, ss. 103-9, doi:10.20515/otd.496524.

Vancouver

1.Hülya Özen, Cengiz Bal. A Study on Missing Data Problem in Random Forest. Osmangazi Tıp Dergisi. 01 Ocak 2020;42(1):103-9. doi:10.20515/otd.496524