Research Article
BibTex RIS Cite

Estimation Of Missing Data In OECD Industrial Production Data By kNN Method

Year 2021, , 955 - 967, 31.08.2021
https://doi.org/10.18506/anemon.888642

Abstract

The Organization for Economic Co-operation and Development (OECD) is an international organization that works to create better policies for better lives. For this aim, OECD collects data on countries in many indicators. In order to make more accurate analyses, these data must be complete. But there are deficiencies in the information collected from different national and international sources. These deficiencies are especially problematic for researchers who want to work using statistical analysis and machine learning methods. For such analysis, data sets must first be cleared of missing data. In general, incomplete data has a negative effect on statistical analysis. There are traditional and modern methods to solve this problem. The data can be missing completely at random (MCAR), missing at random (MAR), and not missing at random (MNAR). For this reason, each data must be handled separately. In the data set titled industrial production in the Main Economic Indicators database, there are 4046 values, 113 missing data and 3933 complete data belonging to 34 countries. In order to divide the data set into different groups, the study used a machine learning algorithm called K-Nearest Neighbor(kNN). Because the kNN algorithm is simple to use, it is widely used. The nearest neighborhood value of the algorithm used in the study was determined as k=15. There was an 86.8% success rate in estimating the missing data.

References

  • Andridge, R.R. & Little, R.J.A. (2010). A Review of Hot Deck Imputation for Survey Non-response, Int Stat Rev.,78(1), 40–64.
  • Batista, G.E.A.P.A. & Monard, M.C. (2002). A study of K-nearest neighbour as an imputation method, Brazilian Research Councils, 1-10.
  • Chen, J. & Shao, J., 2000. Nearest neighbor imputation for survey data. Journal of Official Statistics, 16(2), 113–131.
  • Choudhury, A. & Kosorok, M.R. (2020). Missing data imputation for classication problems. National Cancer Institute, 1-27.
  • Çilingirtürk, A.M. & Altaş, D. (2010). Makro iktisat verilerinde kayıp verilerin regresyona dayalı en yakın komşu “Hot Deck” yöntemi ile tamamlanması. Dokuz Eylül Üniversitesi İktisadi ve İdari Bilimler Fakültesi Dergisi, 25(2), 73-83.
  • Dondersa, A.R.T., Heijdenc, G.J.M.G., Stijnend, T., & Moons, K.G.M. (2006). Review: A gentle introduction to imputation of missing values. Journal of Clinical Epidemiology, 59, 1087-1091.
  • Fendoğlu, E. (2020). Metasezgisel yöntemlerle rotalama problemlerinin çözümü için çok aşamalı bir yaklaşım. Ankara: Gazi Kitabevi
  • Folch-Fortuny, A., Arteaga, F., & Ferrer, A. (2016). Missing data imputation toolbox for MATLAB. Chemometrics and Intelligent Laboratory Systems 154, 93–100.
  • Huang, J. & Sun, H. (2016). Grey relational analysis based k nearest neighbor missing data imputation for software quality datasets. IEEE International Conference on Software Quality, Reliability and Security, 86-91.
  • Idri, A., Abnane, & I., Abran, A. (2016). Missing data techniques in analogy-based software development effort estimation. The Journal of Systems and Software, 117, 595–611.
  • Jamshidian, M. & Mata, M. (2007). Advances in Analysis of Mean and Covariance Structure when Data are Incomplete. Handbook of Computing and Statistics with Applications, 1, 21- 44.
  • Kenyhercz, M.W. & Passalacqua, N.V. (2016). Missing data imputation methods and their performance with biodistance analyses biological distance analysis. Biological Distance Analysis,181-194.
  • Liao, S.G., Lin, Y., Kang, D.D., Chandra, D., Bon, J., Kaminski, N., Sciurba, F.C., & Tseng, G.C. (2014). Missing value imputation in high-dimensional phenomic data: imputable or not, and how?, BMC Bioinformatics, 15(346), 2-12.
  • Little, R.J.A. & Rubin, D.B. (2020). Statistical analysis with missing data, 3rd Edition, JohnWiley & Sons, Inc, ISBN 9781118596012, 1- 462.
  • Malarvizhi, R. & Thanamani, A.S. (2012). K-Nearest Neighbor in Missing Data Imputation. International Journal of Engineering Research and Development, 05-07.
  • Marchang, N. & Tripathi, R. (2017). KNN-ST: Exploiting Spatio-Temporal Correlation for Missing Data Inference in Environmental Crowd Sensing. Ieee Sensors Journal, 1-8.
  • Minakshi, Vohra, R., & Gimpy. (2014). Missing value imputation in multi attribute data set. International Journal of Computer Science and Information Technologies, 5(4) , 5315-5321.
  • Ordóñez, G. C., Lasheras, F.S., Juez, F., & Sánchez, A.B. (2017). Missing data imputation of questionnaires by means of genetic algorithms with different fitness functions. Journal of Computational and Applied Mathematics, 311, 704–717.
  • Osman, M.S., Abu-Mahfouz A.M., & Page, P.R. (2018). A survey on data imputation techniques: Water distribution system as a use case. IEEE 6, 63279- 63291.
  • Pini, A.S.N., Nelso, M.E., Myer, M.M., Shuffre, L.C., Lucchini, M., Elliott, A.J., Odendaal, H.J., & Fifer, W.P. (2020). The K nearest neighbor algorithm for imputation of missing longitudinal prenatal alcohol data. Researchsquare, 1-19.
  • Sanjar, K., Bekhzod, O., Kim, J., Paul, A., & Kim, J. (2020). Missing data imputation for geolocation-based price prediction using KNN–MCF method, ISPRS Int. J. Geo-Inf., 9(227), 1-13.
  • Silva, H., & Perera, A.S. (2017). Evolutionary k-Nearest Neighbor Imputation Algorithm for Gene Expression Data. International Journal on Advances in ICT for Emerging Regions, 10 (1), 1-8.
  • Susanti, Martha, S., & Sulistianingsih, E. (2018). K nearest neighbor dalam imputasi missing data. Buletin Ilmiah Math. Stat. dan Terapannya, 07(1), 9 -14.
  • Thirumahal, R. & Patil, D.A. (2014). KNN and ARL based imputation to estimate missing values. Indonesian Journal of Electrical Engineering and Informatics (IJEEI), 2(3),119-124.
  • Toka, O. & Çetin, M. (2016). Imputation and deletion methods under the presence of missing values and outliers: A comparative study, Gazi University Journal of Science GU J Sci ,29(4), 799-809.
  • Yoon, J., Jordon, J., & Schaar, M. (2018). GAIN: Missing data imputation using generative Adversarial Nets. Proceedings of the 35 th International Conference on Machine Learning, (80), 1-10.
  • Zhang, S., Li, X., Zong, M., Zhu, X., & Cheng, D. (2017). Learning k for kNN classification. ACM Trans. Intell. Syst. Technol,. 8(3),1-19.
  • Zhang, S.(2012). Nearest neighbor selection for iteratively kNN imputation. The Journal of Systems and Software, 85, 2541– 2552.

OECD Endüstriyel Üretim Verilerinde Bulunan Kayıp Verilerin kNN Yöntemi İle Tahmini

Year 2021, , 955 - 967, 31.08.2021
https://doi.org/10.18506/anemon.888642

Abstract

Ekonomik İşbirliği ve Kalkınma Örgütü (OECD), daha iyi yaşamlar oluşturmak için çalışan uluslararası bir organizasyondur. Bu amaç doğrultusunda OECD ülkeler hakkında birçok göstergede veri toplamaktadır. Daha doğru analizler yapabilmek için bu verilerin eksiksiz olması gerekmektedir. Fakat ulusal ve uluslararası farklı kaynaklardan toplanan bilgilerde eksiklikler olmaktadır. Bu eksiklikler özellikle istatiksel analiz ve makine öğrenmesi yöntemleri kullanarak çalışmak isteyen araştırmacılara problem çıkartmaktadır. Bu tür analizler için veri setlerinin öncelikle eksik verilerden temizlenmesi gerekmektedir. Genel olarak eksik veriler istatistiksel analizleri üzerinde olumsuz bir etkiye sahiptir. Bu sorunu çözmek için geleneksel ve modern yöntemler vardır. Değişkenler tamamen rastgele eksik (MCAR), rastgele eksik (MAR) ve rastgele eksik değil (MNAR) olabilir. Bu neden ile her değişken ayrı ayrı ele alınmalıdır. Temel Ekonomik Göstergeler veri tabanı içerisindeki endüstriyel üretim başlıklı veriler setinde 34 ülkeye ait 113 eksik veri ve 3933 tam veri olmak üzere 4046 değer bulunmaktadır. Veri setini farklı gruplara ayırmak için çalışmada k-en yakın komşu (kNN) adı verilen makine öğrenimi algoritmasını kullanılmış. kNN algoritması kullanımının basit olduğundan yaygın olarak kullanılmaktadır. Çalışmada kullanılan algoritmaya ait en yakın komşuluk değeri k=15 olarak belirlenmiştir. Eksik verileri tahmin etmede %86,8’lik bir başarı elde edilmiştir.

References

  • Andridge, R.R. & Little, R.J.A. (2010). A Review of Hot Deck Imputation for Survey Non-response, Int Stat Rev.,78(1), 40–64.
  • Batista, G.E.A.P.A. & Monard, M.C. (2002). A study of K-nearest neighbour as an imputation method, Brazilian Research Councils, 1-10.
  • Chen, J. & Shao, J., 2000. Nearest neighbor imputation for survey data. Journal of Official Statistics, 16(2), 113–131.
  • Choudhury, A. & Kosorok, M.R. (2020). Missing data imputation for classication problems. National Cancer Institute, 1-27.
  • Çilingirtürk, A.M. & Altaş, D. (2010). Makro iktisat verilerinde kayıp verilerin regresyona dayalı en yakın komşu “Hot Deck” yöntemi ile tamamlanması. Dokuz Eylül Üniversitesi İktisadi ve İdari Bilimler Fakültesi Dergisi, 25(2), 73-83.
  • Dondersa, A.R.T., Heijdenc, G.J.M.G., Stijnend, T., & Moons, K.G.M. (2006). Review: A gentle introduction to imputation of missing values. Journal of Clinical Epidemiology, 59, 1087-1091.
  • Fendoğlu, E. (2020). Metasezgisel yöntemlerle rotalama problemlerinin çözümü için çok aşamalı bir yaklaşım. Ankara: Gazi Kitabevi
  • Folch-Fortuny, A., Arteaga, F., & Ferrer, A. (2016). Missing data imputation toolbox for MATLAB. Chemometrics and Intelligent Laboratory Systems 154, 93–100.
  • Huang, J. & Sun, H. (2016). Grey relational analysis based k nearest neighbor missing data imputation for software quality datasets. IEEE International Conference on Software Quality, Reliability and Security, 86-91.
  • Idri, A., Abnane, & I., Abran, A. (2016). Missing data techniques in analogy-based software development effort estimation. The Journal of Systems and Software, 117, 595–611.
  • Jamshidian, M. & Mata, M. (2007). Advances in Analysis of Mean and Covariance Structure when Data are Incomplete. Handbook of Computing and Statistics with Applications, 1, 21- 44.
  • Kenyhercz, M.W. & Passalacqua, N.V. (2016). Missing data imputation methods and their performance with biodistance analyses biological distance analysis. Biological Distance Analysis,181-194.
  • Liao, S.G., Lin, Y., Kang, D.D., Chandra, D., Bon, J., Kaminski, N., Sciurba, F.C., & Tseng, G.C. (2014). Missing value imputation in high-dimensional phenomic data: imputable or not, and how?, BMC Bioinformatics, 15(346), 2-12.
  • Little, R.J.A. & Rubin, D.B. (2020). Statistical analysis with missing data, 3rd Edition, JohnWiley & Sons, Inc, ISBN 9781118596012, 1- 462.
  • Malarvizhi, R. & Thanamani, A.S. (2012). K-Nearest Neighbor in Missing Data Imputation. International Journal of Engineering Research and Development, 05-07.
  • Marchang, N. & Tripathi, R. (2017). KNN-ST: Exploiting Spatio-Temporal Correlation for Missing Data Inference in Environmental Crowd Sensing. Ieee Sensors Journal, 1-8.
  • Minakshi, Vohra, R., & Gimpy. (2014). Missing value imputation in multi attribute data set. International Journal of Computer Science and Information Technologies, 5(4) , 5315-5321.
  • Ordóñez, G. C., Lasheras, F.S., Juez, F., & Sánchez, A.B. (2017). Missing data imputation of questionnaires by means of genetic algorithms with different fitness functions. Journal of Computational and Applied Mathematics, 311, 704–717.
  • Osman, M.S., Abu-Mahfouz A.M., & Page, P.R. (2018). A survey on data imputation techniques: Water distribution system as a use case. IEEE 6, 63279- 63291.
  • Pini, A.S.N., Nelso, M.E., Myer, M.M., Shuffre, L.C., Lucchini, M., Elliott, A.J., Odendaal, H.J., & Fifer, W.P. (2020). The K nearest neighbor algorithm for imputation of missing longitudinal prenatal alcohol data. Researchsquare, 1-19.
  • Sanjar, K., Bekhzod, O., Kim, J., Paul, A., & Kim, J. (2020). Missing data imputation for geolocation-based price prediction using KNN–MCF method, ISPRS Int. J. Geo-Inf., 9(227), 1-13.
  • Silva, H., & Perera, A.S. (2017). Evolutionary k-Nearest Neighbor Imputation Algorithm for Gene Expression Data. International Journal on Advances in ICT for Emerging Regions, 10 (1), 1-8.
  • Susanti, Martha, S., & Sulistianingsih, E. (2018). K nearest neighbor dalam imputasi missing data. Buletin Ilmiah Math. Stat. dan Terapannya, 07(1), 9 -14.
  • Thirumahal, R. & Patil, D.A. (2014). KNN and ARL based imputation to estimate missing values. Indonesian Journal of Electrical Engineering and Informatics (IJEEI), 2(3),119-124.
  • Toka, O. & Çetin, M. (2016). Imputation and deletion methods under the presence of missing values and outliers: A comparative study, Gazi University Journal of Science GU J Sci ,29(4), 799-809.
  • Yoon, J., Jordon, J., & Schaar, M. (2018). GAIN: Missing data imputation using generative Adversarial Nets. Proceedings of the 35 th International Conference on Machine Learning, (80), 1-10.
  • Zhang, S., Li, X., Zong, M., Zhu, X., & Cheng, D. (2017). Learning k for kNN classification. ACM Trans. Intell. Syst. Technol,. 8(3),1-19.
  • Zhang, S.(2012). Nearest neighbor selection for iteratively kNN imputation. The Journal of Systems and Software, 85, 2541– 2552.
There are 28 citations in total.

Details

Primary Language Turkish
Journal Section Research Article
Authors

Serkan Metin 0000-0003-1765-7474

Publication Date August 31, 2021
Acceptance Date April 14, 2021
Published in Issue Year 2021

Cite

APA Metin, S. (2021). OECD Endüstriyel Üretim Verilerinde Bulunan Kayıp Verilerin kNN Yöntemi İle Tahmini. Anemon Muş Alparslan Üniversitesi Sosyal Bilimler Dergisi, 9(4), 955-967. https://doi.org/10.18506/anemon.888642

Anemon Muş Alparslan Üniversitesi Sosyal Bilimler Dergisi Creative Commons Atıf-GayriTicari 4.0 Uluslararası Lisansı (CC BY NC) ile lisanslanmıştır.