A Proposal Method for Missing Value Analysis: Cluster Analysis Approach

Uğur Arcagök; Çiğdem Arıcıgil Çilan

doi:10.17093/alphanumeric.970448

Research Article

A Proposal Method for Missing Value Analysis: Cluster Analysis Approach

Year 2021, , 299 - 310, 31.12.2021

Uğur Arcagök , Çiğdem Arıcıgil Çilan

https://doi.org/10.17093/alphanumeric.970448

Abstract

Imputing values to missing cases is a subject that is frequently met in the fields of Machine Learning and Data Mining, and that require the researchers to study it. It is known that many computer-based analysis algorithms operate under assumption that there is no missing case. The lack of sufficient search of missing case by the researchers is able to negatively affect the performance of analysis results. In this study, it was studied with a data set consisting of 52 variables in order to measure the performance of Corporate Sustainability of district municipalities in Istanbul. Little’s MCAR was applied on 17 variables containing missing case, and it was determined that missing cases were MCAR, namely completely at random. And then Clustering Analysis was applied on 35 variables not containing missing case, and missing case imputations were made based on the clusters formed. It was observed that the cluster labels of municipalities, whose clustering analysis results obtained by data set with 35 variables that didn’t contain missing case, and whose results obtained by the data set with 52 variables following imputation were the same, didn’t change. The lack of change of cluster labels of municipalities indicates that the data set formed following imputation doesn’t draw away from the main data, namely that the data structure doesn’t get disrupted. Consequently, it can be said that clustering analysis is effective in terms of imputing more representative values in the imputation of missing case.

Keywords

Missing Value Analysis, Little’s MCAR Test, K-Nearest Neighbor İmputation Methods, Cluster Analysis

References

Graham, J. W. (2009). Missing data analysis: Making it work in the real world. Annual review of psychology, 60, 549-576. https://doi.org/10.1146/annurev.psych.58.110405.085530
ALPAR, C. (2017). Uygulamalı çok değişkenli istatistiksel yöntemler.
Zhang S., Zhang J., Zhu X., Qin Y., Zhang C. (2008) Missing Value Imputation Based on Data Clustering. In: Gavrilova M.L., Tan C.J.K. (eds) Transactions on Computational Science I. Lecture Notes in Computer Science, vol 4750. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-79299-4_7
Rencher, A. C., & Schimek, M. G. (1997). Methods of multivariate analysis. Computational Statistics, 12(4), 422-422.
Önder, E. (2020). Sağlıkta Gelişmekte Olan Teknolojiler, Yapay Zekâ & R İle Makine Öğrenimi Uygulamaları. Bursa: Dora Yayıncılık.
Acuna, E., & Rodriguez, C. (2004). The treatment of missing values and its effect on classifier accuracy. In Classification, clustering, and data mining applications (pp. 639-647). Springer, Berlin, Heidelberg.
Dixon, J. K. (1979). Pattern recognition with partly missing data. IEEE Transactions on Systems, Man, and Cybernetics, 9(10), 617-621.
Aljuaid, T., & Sasi, S. (2016, August). Proper imputation techniques for missing values in data sets. In 2016 International Conference on Data Science and Engineering (ICDSE) (pp. 1-5). IEEE.
Chan, L. S., & Dunn, O. J. (1972). The treatment of missing values in discriminant analysis—I. The sampling experiment. Journal of the American Statistical Association, 67(338), 473-477.
Tresp, V., Neuneier, R., & Ahmad, S. (1995). Efficient methods for dealing with missing data in supervised learning. In Advances in neural information processing systems (pp. 689-696). MORGAN KAUFMANN PUBLISHERS.
Bello, A. L. (1995). Imputation techniques in regression analysis: looking closely at their implementation. Computational statistics & data analysis, 20(1), 45-57.
Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3), 581-592.
Scheffer, J. (2002). Dealing with missing data.
Hair, J. F. (2009). Multivariate data analysis.
Orhunbilge, N. (1999). Zaman Serisi Analizi Tahmin ve Fiyat İndeksleri. İstanbul: Tunç Matbaacılık.
ÖZDEMİR, M. (2020). R ile Programlama ve Makine Öğrenmesi.
Vinh, N. X., Epps, J., & Bailey, J. (2010). Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. The Journal of Machine Learning Research, 11, 2837-2854.

Year 2021, , 299 - 310, 31.12.2021

Uğur Arcagök , Çiğdem Arıcıgil Çilan

https://doi.org/10.17093/alphanumeric.970448

Abstract

References

Graham, J. W. (2009). Missing data analysis: Making it work in the real world. Annual review of psychology, 60, 549-576. https://doi.org/10.1146/annurev.psych.58.110405.085530
ALPAR, C. (2017). Uygulamalı çok değişkenli istatistiksel yöntemler.
Zhang S., Zhang J., Zhu X., Qin Y., Zhang C. (2008) Missing Value Imputation Based on Data Clustering. In: Gavrilova M.L., Tan C.J.K. (eds) Transactions on Computational Science I. Lecture Notes in Computer Science, vol 4750. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-79299-4_7
Rencher, A. C., & Schimek, M. G. (1997). Methods of multivariate analysis. Computational Statistics, 12(4), 422-422.
Önder, E. (2020). Sağlıkta Gelişmekte Olan Teknolojiler, Yapay Zekâ & R İle Makine Öğrenimi Uygulamaları. Bursa: Dora Yayıncılık.
Acuna, E., & Rodriguez, C. (2004). The treatment of missing values and its effect on classifier accuracy. In Classification, clustering, and data mining applications (pp. 639-647). Springer, Berlin, Heidelberg.
Dixon, J. K. (1979). Pattern recognition with partly missing data. IEEE Transactions on Systems, Man, and Cybernetics, 9(10), 617-621.
Aljuaid, T., & Sasi, S. (2016, August). Proper imputation techniques for missing values in data sets. In 2016 International Conference on Data Science and Engineering (ICDSE) (pp. 1-5). IEEE.
Chan, L. S., & Dunn, O. J. (1972). The treatment of missing values in discriminant analysis—I. The sampling experiment. Journal of the American Statistical Association, 67(338), 473-477.
Tresp, V., Neuneier, R., & Ahmad, S. (1995). Efficient methods for dealing with missing data in supervised learning. In Advances in neural information processing systems (pp. 689-696). MORGAN KAUFMANN PUBLISHERS.
Bello, A. L. (1995). Imputation techniques in regression analysis: looking closely at their implementation. Computational statistics & data analysis, 20(1), 45-57.
Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3), 581-592.
Scheffer, J. (2002). Dealing with missing data.
Hair, J. F. (2009). Multivariate data analysis.
Orhunbilge, N. (1999). Zaman Serisi Analizi Tahmin ve Fiyat İndeksleri. İstanbul: Tunç Matbaacılık.
ÖZDEMİR, M. (2020). R ile Programlama ve Makine Öğrenmesi.
Vinh, N. X., Epps, J., & Bailey, J. (2010). Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. The Journal of Machine Learning Research, 11, 2837-2854.

There are 17 citations in total.

Details

Primary Language	English
Subjects	Operation
Journal Section	Articles
Authors	Uğur Arcagök 0000-0002-4469-9525 Çiğdem Arıcıgil Çilan 0000-0002-7862-7028
Publication Date	December 31, 2021
Submission Date	July 12, 2021
Published in Issue	Year 2021

Cite

APA	Arcagök, U., & Arıcıgil Çilan, Ç. (2021). A Proposal Method for Missing Value Analysis: Cluster Analysis Approach. Alphanumeric Journal, 9(2), 299-310. https://doi.org/10.17093/alphanumeric.970448