Research Article
BibTex RIS Cite

Investigation of The Effects of Data Scaling and Imputation of Missing Data Approaches on The Success of Machine Learning Methods

Year 2023, Volume: 11 Issue: 1, 78 - 88, 31.01.2023
https://doi.org/10.29130/dubited.948564

Abstract

With the innovations in technology and informatics, the size and diversity of the data obtained has increased and it has become easier to record and share this data. Computers and especially machine learning algorithms play a major role in the analysis of this data, which is very difficult to analyze by human hands. In this analysis process, the data preprocessing stage plays a key role in studies on data. In the data preprocessing stage, the missing data is completed and the data scaling process is carried out. In the literature, there are studies that show the effects of missing data completion and data scaling methods on algorithms separately. However, these two important stages need to be evaluated together. In this study, the completion of missing data on the Hepatocellular Carcinoma (HCC) disease data set and the effect of data scaling approaches on the classification success of Artificial Neural Networks, Support Vector Machines and Random Forest Algorithms were investigated. As a result of the research, it was determined that the best classification was achieved by using the mean approach to complete the missing data and min-max data scaling. In addition, it has been determined that the random forest algorithm is more successful than other algorithms in terms of classification

References

  • [1] E. Sezgin and Y. Çelik, “Veri madenciliğinde kayıp veriler için kullanılan yöntemlerin karşılaştırılması,” XV. Akademik Bilişim Konferansı Bildirileri, Antalya, Türkiye, 2013, ss.194-198.
  • [2] T. Jayalakshmi and A. Santhakumaran, “Statistical Normalization and Back Propagationfor Classification”, International Journal of Computer Theory and Engineering vol.3, no.1, pp.793-8201, 2011
  • [3] S. H. Caldwell, D. M. Crespo, H. S. Kang, and A. M. S. Al-Osaimi, “Obesity and hepatocellular carcinoma”, In Gastroenterology, vol. 127, no.5, pp.97–103, 2004.
  • [4] J. Jose, G.K. Vishwakarma, A. Bhattacharjee, “Illustration of missing data handling technique generated from hepatitis C induced hepatocellular carcinoma cohort study”, Journal of King Saud University - Science. vol.33, no.4, 2021.
  • [5] M. Yumus, M. Apaydin, A. Degirmenci, O. Karal, “Missing data imputation using machine learning based methods to improve HCC survival prediction”, 28th Signal Processing and Communications Applications Conference (SIU), Gaziantep, Türkiye, 2020, ss.1-4.
  • [6] F.B. Demir, T. Tuncer, A.F. Kocamaz, F. Ertam, “A survival classification method for hepatocellular carcinoma patients with chaotic Darcy optimization method based feature selection”, Medical Hypotheses, vol.139, 2020.
  • [7] S. Han, A.C. Andrei, K.W. Tsui, Multiple imputation for competing risks survival data via pseudo-observations, Communications for statistical applications and methods, vol.25 , pp. 385–396, 2018.
  • [8] M.S. Santos, P.H. Abreu, P.J. García-Laencina, A. Simão, A. Carvalho, A new cluster-based oversampling method for improving survival prediction of hepatocellular carcinoma patients, Journal of Biomedical Informatics, vol.58 pp.49–59, 2015.
  • [9] E.H. Zaky, M.M. Soliman, A.K. Elkholy, N.I. Ghali, “Enhanced predictive modelling for 30-day readmission diabetes patients based on data normalization analysis”, International Journal of Intelligent Engineering and Systems. vol.14, pp.204–216, 2021.
  • [10] K. Varada Rajkumar, D.K. Subrahmanyam, “A novel method for rainfall prediction and classification using neural networks”, International Journal of Advanced Computer Science and Applications. vol.12, pp. 521–528, 2021.
  • [11] D.H. Djarum, Z. Ahmad, J. Zhang, “Comparing Different Pre-processing Techniques and Machine Learning Models to Predict PM10 and PM2.5 Concentration in Malaysia”, Lecture Notes in Mechanical Engineering, Malaysia, 2021, pp.353–374.
  • [12] I. Duran, R. Leandro, J. Guevara-Coto, “Analysis of different pre-processing techniques to the development of machine learning predictors with gene expression profiles”, Proceedings - 4th Jornadas Costarricenses de Investigacion En Computacion e Informatica, JoCICI, San Pedro, Costa Rica, 2019.
  • [13] R. Houari, A. Bounceur, T. Kechadi, A.K. Tari, R. Euler, “Missing data analysis using multiple imputation in relation to Parkinson’s Disease”, BDAW '16, 2016.
  • [14] G. Madhu, G. Nagachandrika, “A new paradigm for development of data imputation approach for missing value estimation”, International Journal of Electrical and Computer Engineering. Vol.6, no.6, pp.3222–3228, 2016
  • [15] T. Kim, W. Ko, and J. Kim, “Analysis and impact evaluation of missing data imputation in day-ahead PV generation forecasting,” Appl. Sci., vol. 9, no. 1, pp. 204, 2019.
  • [16] S. Yavuz and M. Deveci, “İstatiksel normalizasyon tekniklerinin yapay sinir ağın performansına etkisi” Erciyes Üniversitesi İktisadi ve İdari Bilim. Fakültesi Derg., c. 0, s. 40, ss. 167-187, 2012.
  • [17] P. Cihan, O. Kalipsız, and E. Gökçe, “Hayvan hastalığı teşhisinde normalizasyon tekniklerinin yapay sinir ağı ve özellik seçim performansına etkisi,” Turkish Stud., c. 12, s. 11, ss. 59–70, 2017.
  • [18] Scikitlearn. (2021, May 27) “sklearn.preprocessing.RobustScaler — scikit-learn 0.24.2 documentation,[Online].Available:”https://scikitlearn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html#sklearn.preprocessing.RobustScaler.
  • [19] R. Bakış and S. Göncü, “Akarsu Debi Ölçümlerinde Eksik Verilerin Tamamlanması: Zap Suyu Havzası Örneği,” Anadolu Univ. J. Sci. Technol. Appl. Sci. Eng., c. 16, s. 1, ss. 63, 2015
  • [20] A. Farhangfar, L. Kurgan, and J. Dy, “Impact of imputation of missing values on classification error for discrete data,” Pattern Recognit., vol. 41, no. 12, pp. 3692–3705, 2008.
  • [21] M. K. Markey, G. D. Tourassi, M. Margolis, and D. M. DeLong, “Impact of missing data in evaluating artificial neural networks trained on complete data,” Comput. Biol. Med., vol. 36, no. 5, pp. 516–525, 2006.
  • [22] D. Bertsimas, C. Pawlowski, and Y. D. Zhuo, “From Predictive Methods to Missing Data Imputation: An Optimization Approach,” J. Mach. Learn. Res., vol. 18, pp. 1–39, 2018.
  • [23] G. e. a. p. a. Batista and M. C. Monard, “An analysis of four missing data treatment methods for supervised learning,” Appl. Artif. Intell., vol. 17, no. 5–6, pp. 519–533, 2003.
  • [24] S. A. Naghibi, K. Ahmadi, and A. Daneshi, “Application of Support Vector Machine, Random Forest, and Genetic Algorithm Optimized Random Forest Models in Groundwater Potential Mapping,” Water Resour. Manag., vol. 31, no. 9, pp. 2761–2775, 2017.
  • [25] P. Thanh Noi and M. Kappas, “Comparison of Random Forest, k-Nearest Neighbor, and Support Vector Machine Classifiers for Land Cover Classification Using Sentinel-2 Imagery,” Sensors (Basel)., vol. 18, no. 1, p. 18, 2017..
  • [26] T. Han, D. Jiang, Q. Zhao, L. Wang, and K. Yin, “Comparison of random forest, artificial neural networks and support vector machine for intelligent diagnosis of rotating machinery,” Trans. Inst. Meas. Control, vol. 40, no. 8, pp. 2681–2693, 2018.
  • [27] M. a. m. Hasan, M. Nasser, B. Pal, and S. Ahmad, “Support Vector Machine and Random Forest Modeling for Intrusion Detection System (IDS),” J. Intell. Learn. Syst. Appl., vol. 06, no. 01, pp. 45–52, 2014.
  • [28] I. Ahmad, M. Basheri, M. J. Iqbal, and A. Rahim, “Performance Comparison of Support Vector Machine, Random Forest, and Extreme Learning Machine for Intrusion Detection,” IEEE Access, vol. 6, pp. 33789–33795, 2018.

Veri Ölçekleme ve Eksik Veri Tamamlama Yöntemlerinin Makine Öğrenmesi Yöntemlerinin Başarısına Etkisinin İncelenmesi

Year 2023, Volume: 11 Issue: 1, 78 - 88, 31.01.2023
https://doi.org/10.29130/dubited.948564

Abstract

Teknoloji ve bilişim alanındaki yenilikler ile elde edilen verinin büyüklüğü ve çeşitliliği artarak bu verilerin kaydedilmesi ve paylaşılması da kolaylaşmıştır. İnsan eli ile analiz edilmesi oldukça zor olan bu verilerin analizinde bilgisayarlar ve özellikle makine öğrenmesi algoritmaları büyük rol oynamaktadır. Bu analiz sürecinde veri ön işleme aşaması veri üzerinde yapılan çalışmalarda kilit rol oynamaktadır. Veri ön işleme aşamasında eksik verilerin tamamlanması ve veri ölçekleme işlemi gerçekleştirilmektedir. Literatürde eksik veri tamamlaması ile veri ölçekleme yöntemlerinin algoritmalar üzerindeki etkisini ayrı ayrı gösteren çalışmalar bulunmaktadır. Fakat bu iki önemli aşamanın bir arada değerlendirilmesi de gerekmektedir. Bu çalışmada Hepatoselüler Karsinoma (HCC) hastalığı veri seti üzerinde eksik verilerin tamamlanması ve veri ölçekleme yaklaşımlarının Yapay Sinir Ağları, Destek Vektör Makinaları ve Rassal Orman Algoritmalarının sınıflandırma başarılarına etkisi araştırılmıştır. Araştırma sonucunda en iyi sınıflandırmanın eksik verilerin tamamlanmasında ortalama yaklaşımı kullanılması ve min-max veri ölçeklemesi ile gerçekleştiği tespit edilmiştir. Ayrıca sınıflandırma açısından Rassal Orman algoritmasının diğer algoritmalara göre daha başarılı olduğu tespit edilmiştir

References

  • [1] E. Sezgin and Y. Çelik, “Veri madenciliğinde kayıp veriler için kullanılan yöntemlerin karşılaştırılması,” XV. Akademik Bilişim Konferansı Bildirileri, Antalya, Türkiye, 2013, ss.194-198.
  • [2] T. Jayalakshmi and A. Santhakumaran, “Statistical Normalization and Back Propagationfor Classification”, International Journal of Computer Theory and Engineering vol.3, no.1, pp.793-8201, 2011
  • [3] S. H. Caldwell, D. M. Crespo, H. S. Kang, and A. M. S. Al-Osaimi, “Obesity and hepatocellular carcinoma”, In Gastroenterology, vol. 127, no.5, pp.97–103, 2004.
  • [4] J. Jose, G.K. Vishwakarma, A. Bhattacharjee, “Illustration of missing data handling technique generated from hepatitis C induced hepatocellular carcinoma cohort study”, Journal of King Saud University - Science. vol.33, no.4, 2021.
  • [5] M. Yumus, M. Apaydin, A. Degirmenci, O. Karal, “Missing data imputation using machine learning based methods to improve HCC survival prediction”, 28th Signal Processing and Communications Applications Conference (SIU), Gaziantep, Türkiye, 2020, ss.1-4.
  • [6] F.B. Demir, T. Tuncer, A.F. Kocamaz, F. Ertam, “A survival classification method for hepatocellular carcinoma patients with chaotic Darcy optimization method based feature selection”, Medical Hypotheses, vol.139, 2020.
  • [7] S. Han, A.C. Andrei, K.W. Tsui, Multiple imputation for competing risks survival data via pseudo-observations, Communications for statistical applications and methods, vol.25 , pp. 385–396, 2018.
  • [8] M.S. Santos, P.H. Abreu, P.J. García-Laencina, A. Simão, A. Carvalho, A new cluster-based oversampling method for improving survival prediction of hepatocellular carcinoma patients, Journal of Biomedical Informatics, vol.58 pp.49–59, 2015.
  • [9] E.H. Zaky, M.M. Soliman, A.K. Elkholy, N.I. Ghali, “Enhanced predictive modelling for 30-day readmission diabetes patients based on data normalization analysis”, International Journal of Intelligent Engineering and Systems. vol.14, pp.204–216, 2021.
  • [10] K. Varada Rajkumar, D.K. Subrahmanyam, “A novel method for rainfall prediction and classification using neural networks”, International Journal of Advanced Computer Science and Applications. vol.12, pp. 521–528, 2021.
  • [11] D.H. Djarum, Z. Ahmad, J. Zhang, “Comparing Different Pre-processing Techniques and Machine Learning Models to Predict PM10 and PM2.5 Concentration in Malaysia”, Lecture Notes in Mechanical Engineering, Malaysia, 2021, pp.353–374.
  • [12] I. Duran, R. Leandro, J. Guevara-Coto, “Analysis of different pre-processing techniques to the development of machine learning predictors with gene expression profiles”, Proceedings - 4th Jornadas Costarricenses de Investigacion En Computacion e Informatica, JoCICI, San Pedro, Costa Rica, 2019.
  • [13] R. Houari, A. Bounceur, T. Kechadi, A.K. Tari, R. Euler, “Missing data analysis using multiple imputation in relation to Parkinson’s Disease”, BDAW '16, 2016.
  • [14] G. Madhu, G. Nagachandrika, “A new paradigm for development of data imputation approach for missing value estimation”, International Journal of Electrical and Computer Engineering. Vol.6, no.6, pp.3222–3228, 2016
  • [15] T. Kim, W. Ko, and J. Kim, “Analysis and impact evaluation of missing data imputation in day-ahead PV generation forecasting,” Appl. Sci., vol. 9, no. 1, pp. 204, 2019.
  • [16] S. Yavuz and M. Deveci, “İstatiksel normalizasyon tekniklerinin yapay sinir ağın performansına etkisi” Erciyes Üniversitesi İktisadi ve İdari Bilim. Fakültesi Derg., c. 0, s. 40, ss. 167-187, 2012.
  • [17] P. Cihan, O. Kalipsız, and E. Gökçe, “Hayvan hastalığı teşhisinde normalizasyon tekniklerinin yapay sinir ağı ve özellik seçim performansına etkisi,” Turkish Stud., c. 12, s. 11, ss. 59–70, 2017.
  • [18] Scikitlearn. (2021, May 27) “sklearn.preprocessing.RobustScaler — scikit-learn 0.24.2 documentation,[Online].Available:”https://scikitlearn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html#sklearn.preprocessing.RobustScaler.
  • [19] R. Bakış and S. Göncü, “Akarsu Debi Ölçümlerinde Eksik Verilerin Tamamlanması: Zap Suyu Havzası Örneği,” Anadolu Univ. J. Sci. Technol. Appl. Sci. Eng., c. 16, s. 1, ss. 63, 2015
  • [20] A. Farhangfar, L. Kurgan, and J. Dy, “Impact of imputation of missing values on classification error for discrete data,” Pattern Recognit., vol. 41, no. 12, pp. 3692–3705, 2008.
  • [21] M. K. Markey, G. D. Tourassi, M. Margolis, and D. M. DeLong, “Impact of missing data in evaluating artificial neural networks trained on complete data,” Comput. Biol. Med., vol. 36, no. 5, pp. 516–525, 2006.
  • [22] D. Bertsimas, C. Pawlowski, and Y. D. Zhuo, “From Predictive Methods to Missing Data Imputation: An Optimization Approach,” J. Mach. Learn. Res., vol. 18, pp. 1–39, 2018.
  • [23] G. e. a. p. a. Batista and M. C. Monard, “An analysis of four missing data treatment methods for supervised learning,” Appl. Artif. Intell., vol. 17, no. 5–6, pp. 519–533, 2003.
  • [24] S. A. Naghibi, K. Ahmadi, and A. Daneshi, “Application of Support Vector Machine, Random Forest, and Genetic Algorithm Optimized Random Forest Models in Groundwater Potential Mapping,” Water Resour. Manag., vol. 31, no. 9, pp. 2761–2775, 2017.
  • [25] P. Thanh Noi and M. Kappas, “Comparison of Random Forest, k-Nearest Neighbor, and Support Vector Machine Classifiers for Land Cover Classification Using Sentinel-2 Imagery,” Sensors (Basel)., vol. 18, no. 1, p. 18, 2017..
  • [26] T. Han, D. Jiang, Q. Zhao, L. Wang, and K. Yin, “Comparison of random forest, artificial neural networks and support vector machine for intelligent diagnosis of rotating machinery,” Trans. Inst. Meas. Control, vol. 40, no. 8, pp. 2681–2693, 2018.
  • [27] M. a. m. Hasan, M. Nasser, B. Pal, and S. Ahmad, “Support Vector Machine and Random Forest Modeling for Intrusion Detection System (IDS),” J. Intell. Learn. Syst. Appl., vol. 06, no. 01, pp. 45–52, 2014.
  • [28] I. Ahmad, M. Basheri, M. J. Iqbal, and A. Rahim, “Performance Comparison of Support Vector Machine, Random Forest, and Extreme Learning Machine for Intrusion Detection,” IEEE Access, vol. 6, pp. 33789–33795, 2018.
There are 28 citations in total.

Details

Primary Language Turkish
Subjects Engineering
Journal Section Articles
Authors

Mesut Polatgil 0000-0002-7503-2977

Publication Date January 31, 2023
Published in Issue Year 2023 Volume: 11 Issue: 1

Cite

APA Polatgil, M. (2023). Veri Ölçekleme ve Eksik Veri Tamamlama Yöntemlerinin Makine Öğrenmesi Yöntemlerinin Başarısına Etkisinin İncelenmesi. Duzce University Journal of Science and Technology, 11(1), 78-88. https://doi.org/10.29130/dubited.948564
AMA Polatgil M. Veri Ölçekleme ve Eksik Veri Tamamlama Yöntemlerinin Makine Öğrenmesi Yöntemlerinin Başarısına Etkisinin İncelenmesi. DUBİTED. January 2023;11(1):78-88. doi:10.29130/dubited.948564
Chicago Polatgil, Mesut. “Veri Ölçekleme Ve Eksik Veri Tamamlama Yöntemlerinin Makine Öğrenmesi Yöntemlerinin Başarısına Etkisinin İncelenmesi”. Duzce University Journal of Science and Technology 11, no. 1 (January 2023): 78-88. https://doi.org/10.29130/dubited.948564.
EndNote Polatgil M (January 1, 2023) Veri Ölçekleme ve Eksik Veri Tamamlama Yöntemlerinin Makine Öğrenmesi Yöntemlerinin Başarısına Etkisinin İncelenmesi. Duzce University Journal of Science and Technology 11 1 78–88.
IEEE M. Polatgil, “Veri Ölçekleme ve Eksik Veri Tamamlama Yöntemlerinin Makine Öğrenmesi Yöntemlerinin Başarısına Etkisinin İncelenmesi”, DUBİTED, vol. 11, no. 1, pp. 78–88, 2023, doi: 10.29130/dubited.948564.
ISNAD Polatgil, Mesut. “Veri Ölçekleme Ve Eksik Veri Tamamlama Yöntemlerinin Makine Öğrenmesi Yöntemlerinin Başarısına Etkisinin İncelenmesi”. Duzce University Journal of Science and Technology 11/1 (January 2023), 78-88. https://doi.org/10.29130/dubited.948564.
JAMA Polatgil M. Veri Ölçekleme ve Eksik Veri Tamamlama Yöntemlerinin Makine Öğrenmesi Yöntemlerinin Başarısına Etkisinin İncelenmesi. DUBİTED. 2023;11:78–88.
MLA Polatgil, Mesut. “Veri Ölçekleme Ve Eksik Veri Tamamlama Yöntemlerinin Makine Öğrenmesi Yöntemlerinin Başarısına Etkisinin İncelenmesi”. Duzce University Journal of Science and Technology, vol. 11, no. 1, 2023, pp. 78-88, doi:10.29130/dubited.948564.
Vancouver Polatgil M. Veri Ölçekleme ve Eksik Veri Tamamlama Yöntemlerinin Makine Öğrenmesi Yöntemlerinin Başarısına Etkisinin İncelenmesi. DUBİTED. 2023;11(1):78-8.