A Novel Data Imputation Method (M-CBRI) for Industrial Analytic Applications

Mehmet Alper Şahin; Uğur Üresin

doi:10.2339/politeknik.1201559

Research Article

A Novel Data Imputation Method (M-CBRI) for Industrial Analytic Applications

Year 2024, EARLY VIEW, 1 - 1

Mehmet Alper Şahin , Uğur Üresin

https://doi.org/10.2339/politeknik.1201559

Abstract

Data analysis is mainly based on understanding and preprocessing the data coming from various sources for various applications. Missing values might play a critical role to reflect to characteristic of datasets; thus, imputation of missing values is a valuable process to not only handle reducing deviation but also avoid loss of data. There are different approaches to filling missing values. One of them is correlation-based imputation method. This approach is based on the high correlation between the parameters, these parameters are variables of linear equation, the linear equation enables to predict missing values. In this study, improvements were made to the correlation-based imputation method to predict missing values. The proposed method was performed on three various datasets which are related to the automotive industry. Missing values are handled in a manual process, and these values are picked randomly from the real data. After generating missing values, missing values are predicted using the correlation-based imputation method; furthermore, the margin of error between the estimated value and actual value was calculated. The results were compared to different methods which are arithmetic mean assignment, median value assignment, k- nearest neighbor assignment, and multivariate imputation by chained equations; consequently, much more successful results were obtained with the proposed method for three datasets.

Keywords

missing data imputation, data preprocessing, missing value, data imputation, industrial data processing

Supporting Institution

Ford Otosan

References

[1] Tole A. A., “The Importance of Data Warehouses in the Development of Computerized Decision Support Solutions. A Comparison between Data Warehouses and Data Marts”, Database Systems Journal, Academy of Economic Studies - Bucharest, Romania, (2016).
[2] Foidl, H.& Felderer, M., “An Approach for Assessing Industrial IoT Data Sources to Determine Their Data Trustworthiness.”
[3] Fouad, K. M., Ismail, M. M., Azar, A. T., & Arafa, M. M. “Advanced methods for missing values imputation based on similarity learning”, PeerJ Computer Science, 7, (2021).
[4] Rahman MG, Islam MZ. “Data quality improvement by imputation of missing values”, International Conference on Computer Science and Information Technology. Yogyakarta, Indonesia, 82–88, (2013).
[5] Srivastava, A. K., Kumar, Y., & Singh, P. K, “Hybrid diabetes disease prediction framework based on data imputation and outlier detection techniques”, Expert Systems, (2022).
[6] Lakshminarayan, K., Harp, S.A. & Samad, T., “Imputation of Missing Data in Industrial Databases.”, Applied Intelligence 11, 259–275, (1999).
[7] Jadhav, A., Pramod, D., & Ramanathan, Kr., “Comparison of Performance of Data Imputation Methods for Numeric Dataset. Applied Artificial Intelligence.”, (2019).
[8] Armina, R., Mohd Zain, A., Ali, N. A., & Sallehuddin, R. “A Review on Missing Value Estimation Using Imputation Algorithm.”, Journal of Physics: Conference Series, 892, (2017).
[9] www.stat.columbia.edu, “Missing-data imputation”.
[10] Bania, R. K., Halder, A., “R-ensembler: A greedy rough set based ensemble attribute selection algorithm with KNN imputation for classification of Medical Data.”, Computer Methods and Programs in Biomedicine,184, (2020).
[11] Buuren, S. “Flexible Imputation of Missing Data,” Second Edition, (2018).
[12] Little, R. J. A., & Rubin, D. B. “Statistical Analysis with Missing Data.” Third Edition, Wiley, (2019).
[13] Troyanskaya, O., et all., “Missing value estimation Methods for DNA microarrays.” Bioinformatics, 520–525, (2001).
[14] Zhang, S., “Nearest neighbor selection for iteratively kNN imputation.”, Journal of Systems and Software, 2541–2552, (2012).
[15] Rubin, D.B, “Inference and missing data”, Biometrika, (1976).
[16] Azur, M.J., Stuart, E.A., Frangakis, C. and Leaf, P.J., “Multiple imputation by chained equations: what is it and how does it work?”, International Journal of Methods in Psychiatric Research, 40–49, (2011).
[17] Van Buuren S, K Groothuis-Oudshoorn, Leerstoel Van Buuren, & And, M., “mice: Multivariate Imputation by Chained Equations:”, 259-268, (2012).
[18] Üresin, U., “Correlation based regression imputation (CBRI) method for missing data imputation.”, Turkish Journal of Science and Technology., (2021).
[19] Uttley J., “Power Analysis, Sample Size, and Assessment of Statistical Assumptions—Improving the Evidential Value of Lighting Research”, 143-162 (2019).
[20] Gu, Y., Wei, H.-L., “A robust model structure selection method for small sample size and multiple datasets problems.”, Information Sciences, (2018).

Endüstriyel Analitik Uygulamaları için Eksik Verilere Değer Atama(M-CBRI)

Year 2024, EARLY VIEW, 1 - 1

Mehmet Alper Şahin , Uğur Üresin

https://doi.org/10.2339/politeknik.1201559

Abstract

Veri analitiği çalışmalarının ilk aşamaları, veriyi toplama, veriyi analiz etme ve veriyi temizleme şeklindedir. Toplanan verilerin, farklı kaynaklardan elde edilmesi ve veri kaynaklarındaki kesilmeler, veriseti içerisinde eksik değerlerin oluşmasına sebep olabilmektedir. Bununla birlikte, veriyi temizleme çalışmalarında bazı aykırı değerlerin verisetinden çıkarılması da yine eksik değerlerin oluşmasına yol açmaktadır. Veride yer alan eksik değerler, analitik uygulamalarda elde edilmek istenen çıktılarda sapmalara sebep olabilir. Hem bu sapmayı azaltmak hem de toplanan veride kayıp yaşamamak adına eksik verilerin giderilmesi önemli bir süreçtir. Literatürde, eksik verilerin yerine değer atanması konusunda pek çok yöntem yer almaktadır ama söz konusu yöntemlerden uygun olanın seçilmesi tecrübe ve uzmanlık gerektirmektedir. Bu çalışmada, eksik verileri tahminlemek adına doğrusal korelasyona bağlı değer atama algoritması üzerinden geliştirmeler yapılmıştır. Bu algoritma, bir otomotiv üretecisinin farklı proseslerinden elde edilen üç farklı gerçek veriseti üzerinde test edilmiştir. Verisetlerinden rastgele silinen veriler, geliştirilen yöntemler yardımıyla tahminlenmiştir ve tahminlenen değer ile gerçek değer arasındaki hata payı hesaplanmıştır. Geliştirilen algoritmanın sonuçları, ortalama değer atama, medyan değer atama, en yakın komşuya göre değer atama ve zincir denklemlerle çok değişkenli değer atama yöntemleriyle karşılaştırılmıştır. Üç veriseti için de, geliştirilen yöntemin diğer yöntemlere göre daha başarılı tahminde bulunduğu gözlemlenmiştir.

Keywords

eksik verilere değer atama, veri önişleme, eksik veri, değer atama, endüstriyel veri işleme

References

[1] Tole A. A., “The Importance of Data Warehouses in the Development of Computerized Decision Support Solutions. A Comparison between Data Warehouses and Data Marts”, Database Systems Journal, Academy of Economic Studies - Bucharest, Romania, (2016).
[2] Foidl, H.& Felderer, M., “An Approach for Assessing Industrial IoT Data Sources to Determine Their Data Trustworthiness.”
[3] Fouad, K. M., Ismail, M. M., Azar, A. T., & Arafa, M. M. “Advanced methods for missing values imputation based on similarity learning”, PeerJ Computer Science, 7, (2021).
[4] Rahman MG, Islam MZ. “Data quality improvement by imputation of missing values”, International Conference on Computer Science and Information Technology. Yogyakarta, Indonesia, 82–88, (2013).
[5] Srivastava, A. K., Kumar, Y., & Singh, P. K, “Hybrid diabetes disease prediction framework based on data imputation and outlier detection techniques”, Expert Systems, (2022).
[6] Lakshminarayan, K., Harp, S.A. & Samad, T., “Imputation of Missing Data in Industrial Databases.”, Applied Intelligence 11, 259–275, (1999).
[7] Jadhav, A., Pramod, D., & Ramanathan, Kr., “Comparison of Performance of Data Imputation Methods for Numeric Dataset. Applied Artificial Intelligence.”, (2019).
[8] Armina, R., Mohd Zain, A., Ali, N. A., & Sallehuddin, R. “A Review on Missing Value Estimation Using Imputation Algorithm.”, Journal of Physics: Conference Series, 892, (2017).
[9] www.stat.columbia.edu, “Missing-data imputation”.
[10] Bania, R. K., Halder, A., “R-ensembler: A greedy rough set based ensemble attribute selection algorithm with KNN imputation for classification of Medical Data.”, Computer Methods and Programs in Biomedicine,184, (2020).
[11] Buuren, S. “Flexible Imputation of Missing Data,” Second Edition, (2018).
[12] Little, R. J. A., & Rubin, D. B. “Statistical Analysis with Missing Data.” Third Edition, Wiley, (2019).
[13] Troyanskaya, O., et all., “Missing value estimation Methods for DNA microarrays.” Bioinformatics, 520–525, (2001).
[14] Zhang, S., “Nearest neighbor selection for iteratively kNN imputation.”, Journal of Systems and Software, 2541–2552, (2012).
[15] Rubin, D.B, “Inference and missing data”, Biometrika, (1976).
[16] Azur, M.J., Stuart, E.A., Frangakis, C. and Leaf, P.J., “Multiple imputation by chained equations: what is it and how does it work?”, International Journal of Methods in Psychiatric Research, 40–49, (2011).
[17] Van Buuren S, K Groothuis-Oudshoorn, Leerstoel Van Buuren, & And, M., “mice: Multivariate Imputation by Chained Equations:”, 259-268, (2012).
[18] Üresin, U., “Correlation based regression imputation (CBRI) method for missing data imputation.”, Turkish Journal of Science and Technology., (2021).
[19] Uttley J., “Power Analysis, Sample Size, and Assessment of Statistical Assumptions—Improving the Evidential Value of Lighting Research”, 143-162 (2019).
[20] Gu, Y., Wei, H.-L., “A robust model structure selection method for small sample size and multiple datasets problems.”, Information Sciences, (2018).

There are 20 citations in total.

Details

Primary Language	English
Subjects	Engineering
Journal Section	Research Article
Authors	Mehmet Alper Şahin 0000-0003-1196-8765 Uğur Üresin 0000-0002-9100-9697
Early Pub Date	March 15, 2024
Publication Date
Submission Date	November 16, 2022
Published in Issue	Year 2024 EARLY VIEW

Cite

APA	Şahin, M. A., & Üresin, U. (2024). A Novel Data Imputation Method (M-CBRI) for Industrial Analytic Applications. Politeknik Dergisi1-1. https://doi.org/10.2339/politeknik.1201559
AMA	Şahin MA, Üresin U. A Novel Data Imputation Method (M-CBRI) for Industrial Analytic Applications. Politeknik Dergisi. Published online March 1, 2024:1-1. doi:10.2339/politeknik.1201559
Chicago	Şahin, Mehmet Alper, and Uğur Üresin. “A Novel Data Imputation Method (M-CBRI) for Industrial Analytic Applications”. Politeknik Dergisi, March (March 2024), 1-1. https://doi.org/10.2339/politeknik.1201559.
EndNote	Şahin MA, Üresin U (March 1, 2024) A Novel Data Imputation Method (M-CBRI) for Industrial Analytic Applications. Politeknik Dergisi 1–1.
IEEE	M. A. Şahin and U. Üresin, “A Novel Data Imputation Method (M-CBRI) for Industrial Analytic Applications”, Politeknik Dergisi, pp. 1–1, March 2024, doi: 10.2339/politeknik.1201559.
ISNAD	Şahin, Mehmet Alper - Üresin, Uğur. “A Novel Data Imputation Method (M-CBRI) for Industrial Analytic Applications”. Politeknik Dergisi. March 2024. 1-1. https://doi.org/10.2339/politeknik.1201559.
JAMA	Şahin MA, Üresin U. A Novel Data Imputation Method (M-CBRI) for Industrial Analytic Applications. Politeknik Dergisi. 2024;:1–1.
MLA	Şahin, Mehmet Alper and Uğur Üresin. “A Novel Data Imputation Method (M-CBRI) for Industrial Analytic Applications”. Politeknik Dergisi, 2024, pp. 1-1, doi:10.2339/politeknik.1201559.
Vancouver	Şahin MA, Üresin U. A Novel Data Imputation Method (M-CBRI) for Industrial Analytic Applications. Politeknik Dergisi. 2024:1-.