Araştırma Makalesi
BibTex RIS Kaynak Göster

A Comparison of Five Methods for Missing Value Imputation in Data Sets

Yıl 2018, Cilt: 2 Sayı: 2, 80 - 85, 31.12.2018

Öz

The missing values in the data sets do not allow for accurate analysis. Therefore, the correct imputation of missing values has become the focus of attention of researchers in recent years. This paper focuses on a comparison of most reliable and up to date estimation methods to imputing the missing values. Imputation of missing values has a very high priority because of its impact on next pre-processing, data analysis, classification, clustering, etc. Root mean square error (RMSE) value, classification accuracy and execution time are used to evaluate the performances of most popular five methods (mean, k-nearest neighbors, singular value decomposition, bayesian principal component analysis and missForest). When RMSE and classification accuracy values of methods were compared, it has observed that missForest method outperformed other methods in all datasets.

Kaynakça

  • [1] T.D. Pigott, “A review of methods for missing data”, Educational Resarch and Evaluation, Cilt. 7, s. 353-383. DOI: 10.1076/edre.7.4.353.8937, 2001.
  • [2] P.D. Allison, “Missing data techniques for structural equation modeling”, Journal of Abnormal Psychology, Cilt. 4, s. 545-557. DOI: 10.1037/0021-843X.112.4.545, 2003.
  • [3] J.W. Osborne, “Best practices in data cleaning”, California: Sage Publication, Inc., s. 596, 2013.
  • [4] A.G. Di Nuovo, “Missing data analysis with fuzzy C-Means: A study of its application in a psychological scenario”, Expert Syst Appl, Cilt. 38, s. 6793-6797, DOI: 10.1016/j.eswa.2010.12.067, 2011.
  • [5] C. Bergmeir, J.M. Benitez, “On the use of cross-validation for time series predictor evaluation”, Inform Sciences, Cilt. 191, s. 192-213, DOI: 10.1016/j.ins.2011.12.028, 2012.
  • [6] J. Van Hulse, and T.M. Khoshgoftaar, “Incomplete-case nearest neighbor imputation in software measurement data”, IRI 2007: Proceedings of the 2007 IEEE International Conference on Information Reuse and Integration, s. 630-637 DOI: 10.1109/IRI.2007.4296691, 2007.
  • [7] S. Genc, F.E.Boran, D. Akay, and Z.S. Xu, “Interval multiplicative transitivity for consistency, missing values and priority weights of interval fuzzy preference relations”, Inform Sciences, Cilt. 180, s. 4877-4891, DOI: 10.1016/j.ins.2010.08.019, 2010.
  • [8] R.J.A. Little, and D.B. Rubin, “Statistical Analysis with Missing Data”, 333. John Wiley & Sons, 2014.
  • [9] O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, vd. “Missing value estimation methods for dna microarrays”, Bioinformatics, Cilt. 17, s. 520-525, 2001.
  • [10] S. Oba, M.A. Sato, I. Takemasa, M. Monden, K.I. Matsubara, and S. Ishii, “A bayesian missing value estimation method for gene expression profile data”, Bioinformatics, Cilt. 19, s. 2088-2096, DOI: 10.1093/bioinformatics/btg287, 2003.
  • [11] D.J. Stekhoven, and P. Buhlmann, “Miss Forest - nonparametric missing value imputation for mixed-type data”, Bioinformatics. Cilt. 28, s. 112-118, DOI: 10.1093/bioinformatics/btr597, 2012.
  • [12] T. Marwala, “Computational intelligence for missing data imputation, estimation and management:knowledge optimization techniques”, Information Science Reference, Hershey PA, 2009.
  • [13] P. Cihan, “Veri madenciliği yöntemleriyle hayvan hastalıklarında teşhis, prognoz ve risk faktörlerinin belirlenmesi”. Yıldız Teknik Üniversitesi, Fen Bilimleri Enstitüsü, Doktora Tezi, 101s, İstanbul, 2018.
  • [14] Lichman, M. UCI Machine Learning Repository, http:// archive.ics.uci.edu/ml (Erişim Tarihi: 15.03.2018).
  • [15] L. Breiman, Random forests, in: Machine Learning, Kluwer Academic Publishers, 2001.
  • [16] R.S. Marko, “Improving random forests”, In: European conference on machine learning, Springer, Berlin, Heidelberg, 2004.
  • [17] H. Kim, H. Kim, H. Moon, and H. Ahn, “A weight-adjusted voting algorithm for ensembles of classifiers”, J. Korean Stat. Soc., Cilt. 40, s. 437-449, DOI: 10.1016/j.jkss.2011.03.002, 2011.
  • [18] H.B. Li, W. Wang, H.W. Ding, and J. Dong, “Trees weighting random forest method for classifying high-dimensional noisy data”, e-Business Engineering (ICEBE) in: 2010 IEEE 7th International Conference on IEEE, s. 160–163, 2010.

A Comparison of Five Methods for Missing Value Imputation in Data Sets

Yıl 2018, Cilt: 2 Sayı: 2, 80 - 85, 31.12.2018

Öz





The
missing values in the data sets do not allow for accurate analysis. Therefore,
the correct imputation of missing values has become the focus of attention of
researchers in recent years. This paper focuses on a comparison of most
reliable and up to date estimation methods to imputing the missing values.
Imputation of missing values has a very high priority because of its impact on
next pre-processing, data analysis, classification, clustering, etc. Root mean
square error (RMSE) value, classification accuracy and execution time are used
to evaluate the performances of most popular five methods (mean, k-nearest
neighbors, singular value decomposition, bayesian principal component analysis
and missForest). When RMSE and classification accuracy values of methods were
compared, it has observed that missForest method outperformed other methods in all
datasets.





Kaynakça

  • [1] T.D. Pigott, “A review of methods for missing data”, Educational Resarch and Evaluation, Cilt. 7, s. 353-383. DOI: 10.1076/edre.7.4.353.8937, 2001.
  • [2] P.D. Allison, “Missing data techniques for structural equation modeling”, Journal of Abnormal Psychology, Cilt. 4, s. 545-557. DOI: 10.1037/0021-843X.112.4.545, 2003.
  • [3] J.W. Osborne, “Best practices in data cleaning”, California: Sage Publication, Inc., s. 596, 2013.
  • [4] A.G. Di Nuovo, “Missing data analysis with fuzzy C-Means: A study of its application in a psychological scenario”, Expert Syst Appl, Cilt. 38, s. 6793-6797, DOI: 10.1016/j.eswa.2010.12.067, 2011.
  • [5] C. Bergmeir, J.M. Benitez, “On the use of cross-validation for time series predictor evaluation”, Inform Sciences, Cilt. 191, s. 192-213, DOI: 10.1016/j.ins.2011.12.028, 2012.
  • [6] J. Van Hulse, and T.M. Khoshgoftaar, “Incomplete-case nearest neighbor imputation in software measurement data”, IRI 2007: Proceedings of the 2007 IEEE International Conference on Information Reuse and Integration, s. 630-637 DOI: 10.1109/IRI.2007.4296691, 2007.
  • [7] S. Genc, F.E.Boran, D. Akay, and Z.S. Xu, “Interval multiplicative transitivity for consistency, missing values and priority weights of interval fuzzy preference relations”, Inform Sciences, Cilt. 180, s. 4877-4891, DOI: 10.1016/j.ins.2010.08.019, 2010.
  • [8] R.J.A. Little, and D.B. Rubin, “Statistical Analysis with Missing Data”, 333. John Wiley & Sons, 2014.
  • [9] O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, vd. “Missing value estimation methods for dna microarrays”, Bioinformatics, Cilt. 17, s. 520-525, 2001.
  • [10] S. Oba, M.A. Sato, I. Takemasa, M. Monden, K.I. Matsubara, and S. Ishii, “A bayesian missing value estimation method for gene expression profile data”, Bioinformatics, Cilt. 19, s. 2088-2096, DOI: 10.1093/bioinformatics/btg287, 2003.
  • [11] D.J. Stekhoven, and P. Buhlmann, “Miss Forest - nonparametric missing value imputation for mixed-type data”, Bioinformatics. Cilt. 28, s. 112-118, DOI: 10.1093/bioinformatics/btr597, 2012.
  • [12] T. Marwala, “Computational intelligence for missing data imputation, estimation and management:knowledge optimization techniques”, Information Science Reference, Hershey PA, 2009.
  • [13] P. Cihan, “Veri madenciliği yöntemleriyle hayvan hastalıklarında teşhis, prognoz ve risk faktörlerinin belirlenmesi”. Yıldız Teknik Üniversitesi, Fen Bilimleri Enstitüsü, Doktora Tezi, 101s, İstanbul, 2018.
  • [14] Lichman, M. UCI Machine Learning Repository, http:// archive.ics.uci.edu/ml (Erişim Tarihi: 15.03.2018).
  • [15] L. Breiman, Random forests, in: Machine Learning, Kluwer Academic Publishers, 2001.
  • [16] R.S. Marko, “Improving random forests”, In: European conference on machine learning, Springer, Berlin, Heidelberg, 2004.
  • [17] H. Kim, H. Kim, H. Moon, and H. Ahn, “A weight-adjusted voting algorithm for ensembles of classifiers”, J. Korean Stat. Soc., Cilt. 40, s. 437-449, DOI: 10.1016/j.jkss.2011.03.002, 2011.
  • [18] H.B. Li, W. Wang, H.W. Ding, and J. Dong, “Trees weighting random forest method for classifying high-dimensional noisy data”, e-Business Engineering (ICEBE) in: 2010 IEEE 7th International Conference on IEEE, s. 160–163, 2010.
Toplam 18 adet kaynakça vardır.

Ayrıntılar

Birincil Dil Türkçe
Konular Bilgisayar Yazılımı
Bölüm Makaleler
Yazarlar

Pınar Cihan 0000-0001-7958-7251

Yayımlanma Tarihi 31 Aralık 2018
Kabul Tarihi 12 Aralık 2018
Yayımlandığı Sayı Yıl 2018 Cilt: 2 Sayı: 2

Kaynak Göster

APA Cihan, P. (2018). A Comparison of Five Methods for Missing Value Imputation in Data Sets. International Scientific and Vocational Studies Journal, 2(2), 80-85.
AMA Cihan P. A Comparison of Five Methods for Missing Value Imputation in Data Sets. ISVOS. Aralık 2018;2(2):80-85.
Chicago Cihan, Pınar. “A Comparison of Five Methods for Missing Value Imputation in Data Sets”. International Scientific and Vocational Studies Journal 2, sy. 2 (Aralık 2018): 80-85.
EndNote Cihan P (01 Aralık 2018) A Comparison of Five Methods for Missing Value Imputation in Data Sets. International Scientific and Vocational Studies Journal 2 2 80–85.
IEEE P. Cihan, “A Comparison of Five Methods for Missing Value Imputation in Data Sets”, ISVOS, c. 2, sy. 2, ss. 80–85, 2018.
ISNAD Cihan, Pınar. “A Comparison of Five Methods for Missing Value Imputation in Data Sets”. International Scientific and Vocational Studies Journal 2/2 (Aralık 2018), 80-85.
JAMA Cihan P. A Comparison of Five Methods for Missing Value Imputation in Data Sets. ISVOS. 2018;2:80–85.
MLA Cihan, Pınar. “A Comparison of Five Methods for Missing Value Imputation in Data Sets”. International Scientific and Vocational Studies Journal, c. 2, sy. 2, 2018, ss. 80-85.
Vancouver Cihan P. A Comparison of Five Methods for Missing Value Imputation in Data Sets. ISVOS. 2018;2(2):80-5.


Creative Commons License
Creative Commons Atıf 4.0 It is licensed under an International License