Veri Ölçekleme ve Eksik Veri Tamamlama Yöntemlerinin Makine Öğrenmesi Yöntemlerinin Başarısına Etkisinin İncelenmesi

Mesut Polatgil

doi:10.29130/dubited.948564

EN TR

Investigation of The Effects of Data Scaling and Imputation of Missing Data Approaches on The Success of Machine Learning Methods

Abstract

With the innovations in technology and informatics, the size and diversity of the data obtained has increased and it has become easier to record and share this data. Computers and especially machine learning algorithms play a major role in the analysis of this data, which is very difficult to analyze by human hands. In this analysis process, the data preprocessing stage plays a key role in studies on data. In the data preprocessing stage, the missing data is completed and the data scaling process is carried out. In the literature, there are studies that show the effects of missing data completion and data scaling methods on algorithms separately. However, these two important stages need to be evaluated together. In this study, the completion of missing data on the Hepatocellular Carcinoma (HCC) disease data set and the effect of data scaling approaches on the classification success of Artificial Neural Networks, Support Vector Machines and Random Forest Algorithms were investigated. As a result of the research, it was determined that the best classification was achieved by using the mean approach to complete the missing data and min-max data scaling. In addition, it has been determined that the random forest algorithm is more successful than other algorithms in terms of classification

Keywords

Missing data, Hepatocellular Carcinoma, Data Scaling, Machine learning

Veri Ölçekleme ve Eksik Veri Tamamlama Yöntemlerinin Makine Öğrenmesi Yöntemlerinin Başarısına Etkisinin İncelenmesi

Abstract

Teknoloji ve bilişim alanındaki yenilikler ile elde edilen verinin büyüklüğü ve çeşitliliği artarak bu verilerin kaydedilmesi ve paylaşılması da kolaylaşmıştır. İnsan eli ile analiz edilmesi oldukça zor olan bu verilerin analizinde bilgisayarlar ve özellikle makine öğrenmesi algoritmaları büyük rol oynamaktadır. Bu analiz sürecinde veri ön işleme aşaması veri üzerinde yapılan çalışmalarda kilit rol oynamaktadır. Veri ön işleme aşamasında eksik verilerin tamamlanması ve veri ölçekleme işlemi gerçekleştirilmektedir. Literatürde eksik veri tamamlaması ile veri ölçekleme yöntemlerinin algoritmalar üzerindeki etkisini ayrı ayrı gösteren çalışmalar bulunmaktadır. Fakat bu iki önemli aşamanın bir arada değerlendirilmesi de gerekmektedir. Bu çalışmada Hepatoselüler Karsinoma (HCC) hastalığı veri seti üzerinde eksik verilerin tamamlanması ve veri ölçekleme yaklaşımlarının Yapay Sinir Ağları, Destek Vektör Makinaları ve Rassal Orman Algoritmalarının sınıflandırma başarılarına etkisi araştırılmıştır. Araştırma sonucunda en iyi sınıflandırmanın eksik verilerin tamamlanmasında ortalama yaklaşımı kullanılması ve min-max veri ölçeklemesi ile gerçekleştiği tespit edilmiştir. Ayrıca sınıflandırma açısından Rassal Orman algoritmasının diğer algoritmalara göre daha başarılı olduğu tespit edilmiştir

Keywords

Eksik veri, Hepatoselüler Karsinoma, Veri Ölçekleme, Makine öğrenmesi, Missing data, Hepatocellular Carcinoma, Data Scaling, Machine learning

References

[1] E. Sezgin and Y. Çelik, “Veri madenciliğinde kayıp veriler için kullanılan yöntemlerin karşılaştırılması,” XV. Akademik Bilişim Konferansı Bildirileri, Antalya, Türkiye, 2013, ss.194-198.
[2] T. Jayalakshmi and A. Santhakumaran, “Statistical Normalization and Back Propagationfor Classification”, International Journal of Computer Theory and Engineering vol.3, no.1, pp.793-8201, 2011
[3] S. H. Caldwell, D. M. Crespo, H. S. Kang, and A. M. S. Al-Osaimi, “Obesity and hepatocellular carcinoma”, In Gastroenterology, vol. 127, no.5, pp.97–103, 2004.
[4] J. Jose, G.K. Vishwakarma, A. Bhattacharjee, “Illustration of missing data handling technique generated from hepatitis C induced hepatocellular carcinoma cohort study”, Journal of King Saud University - Science. vol.33, no.4, 2021.
[5] M. Yumus, M. Apaydin, A. Degirmenci, O. Karal, “Missing data imputation using machine learning based methods to improve HCC survival prediction”, 28th Signal Processing and Communications Applications Conference (SIU), Gaziantep, Türkiye, 2020, ss.1-4.
[6] F.B. Demir, T. Tuncer, A.F. Kocamaz, F. Ertam, “A survival classification method for hepatocellular carcinoma patients with chaotic Darcy optimization method based feature selection”, Medical Hypotheses, vol.139, 2020.
[7] S. Han, A.C. Andrei, K.W. Tsui, Multiple imputation for competing risks survival data via pseudo-observations, Communications for statistical applications and methods, vol.25 , pp. 385–396, 2018.
[8] M.S. Santos, P.H. Abreu, P.J. García-Laencina, A. Simão, A. Carvalho, A new cluster-based oversampling method for improving survival prediction of hepatocellular carcinoma patients, Journal of Biomedical Informatics, vol.58 pp.49–59, 2015.
[9] E.H. Zaky, M.M. Soliman, A.K. Elkholy, N.I. Ghali, “Enhanced predictive modelling for 30-day readmission diabetes patients based on data normalization analysis”, International Journal of Intelligent Engineering and Systems. vol.14, pp.204–216, 2021.
[10] K. Varada Rajkumar, D.K. Subrahmanyam, “A novel method for rainfall prediction and classification using neural networks”, International Journal of Advanced Computer Science and Applications. vol.12, pp. 521–528, 2021.

[11] D.H. Djarum, Z. Ahmad, J. Zhang, “Comparing Different Pre-processing Techniques and Machine Learning Models to Predict PM10 and PM2.5 Concentration in Malaysia”, Lecture Notes in Mechanical Engineering, Malaysia, 2021, pp.353–374.
[12] I. Duran, R. Leandro, J. Guevara-Coto, “Analysis of different pre-processing techniques to the development of machine learning predictors with gene expression profiles”, Proceedings - 4th Jornadas Costarricenses de Investigacion En Computacion e Informatica, JoCICI, San Pedro, Costa Rica, 2019.
[13] R. Houari, A. Bounceur, T. Kechadi, A.K. Tari, R. Euler, “Missing data analysis using multiple imputation in relation to Parkinson’s Disease”, BDAW '16, 2016.
[14] G. Madhu, G. Nagachandrika, “A new paradigm for development of data imputation approach for missing value estimation”, International Journal of Electrical and Computer Engineering. Vol.6, no.6, pp.3222–3228, 2016
[15] T. Kim, W. Ko, and J. Kim, “Analysis and impact evaluation of missing data imputation in day-ahead PV generation forecasting,” Appl. Sci., vol. 9, no. 1, pp. 204, 2019.
[16] S. Yavuz and M. Deveci, “İstatiksel normalizasyon tekniklerinin yapay sinir ağın performansına etkisi” Erciyes Üniversitesi İktisadi ve İdari Bilim. Fakültesi Derg., c. 0, s. 40, ss. 167-187, 2012.
[17] P. Cihan, O. Kalipsız, and E. Gökçe, “Hayvan hastalığı teşhisinde normalizasyon tekniklerinin yapay sinir ağı ve özellik seçim performansına etkisi,” Turkish Stud., c. 12, s. 11, ss. 59–70, 2017.
[18] Scikitlearn. (2021, May 27) “sklearn.preprocessing.RobustScaler — scikit-learn 0.24.2 documentation,[Online].Available:”https://scikitlearn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html#sklearn.preprocessing.RobustScaler.
[19] R. Bakış and S. Göncü, “Akarsu Debi Ölçümlerinde Eksik Verilerin Tamamlanması: Zap Suyu Havzası Örneği,” Anadolu Univ. J. Sci. Technol. Appl. Sci. Eng., c. 16, s. 1, ss. 63, 2015
[20] A. Farhangfar, L. Kurgan, and J. Dy, “Impact of imputation of missing values on classification error for discrete data,” Pattern Recognit., vol. 41, no. 12, pp. 3692–3705, 2008.
[21] M. K. Markey, G. D. Tourassi, M. Margolis, and D. M. DeLong, “Impact of missing data in evaluating artificial neural networks trained on complete data,” Comput. Biol. Med., vol. 36, no. 5, pp. 516–525, 2006.
[22] D. Bertsimas, C. Pawlowski, and Y. D. Zhuo, “From Predictive Methods to Missing Data Imputation: An Optimization Approach,” J. Mach. Learn. Res., vol. 18, pp. 1–39, 2018.
[23] G. e. a. p. a. Batista and M. C. Monard, “An analysis of four missing data treatment methods for supervised learning,” Appl. Artif. Intell., vol. 17, no. 5–6, pp. 519–533, 2003.
[24] S. A. Naghibi, K. Ahmadi, and A. Daneshi, “Application of Support Vector Machine, Random Forest, and Genetic Algorithm Optimized Random Forest Models in Groundwater Potential Mapping,” Water Resour. Manag., vol. 31, no. 9, pp. 2761–2775, 2017.
[25] P. Thanh Noi and M. Kappas, “Comparison of Random Forest, k-Nearest Neighbor, and Support Vector Machine Classifiers for Land Cover Classification Using Sentinel-2 Imagery,” Sensors (Basel)., vol. 18, no. 1, p. 18, 2017..
[26] T. Han, D. Jiang, Q. Zhao, L. Wang, and K. Yin, “Comparison of random forest, artificial neural networks and support vector machine for intelligent diagnosis of rotating machinery,” Trans. Inst. Meas. Control, vol. 40, no. 8, pp. 2681–2693, 2018.
[27] M. a. m. Hasan, M. Nasser, B. Pal, and S. Ahmad, “Support Vector Machine and Random Forest Modeling for Intrusion Detection System (IDS),” J. Intell. Learn. Syst. Appl., vol. 06, no. 01, pp. 45–52, 2014.
[28] I. Ahmad, M. Basheri, M. J. Iqbal, and A. Rahim, “Performance Comparison of Support Vector Machine, Random Forest, and Extreme Learning Machine for Intrusion Detection,” IEEE Access, vol. 6, pp. 33789–33795, 2018.

Details

Primary Language

Turkish

Subjects

Engineering

Journal Section

Research Article

Authors

Mesut Polatgil ^*
0000-0002-7503-2977
Türkiye

Publication Date

January 31, 2023

Submission Date

June 6, 2021

Acceptance Date

March 15, 2022

Published in Issue

Year 2023 Volume: 11 Number: 1

DOI

https://doi.org/10.29130/dubited.948564

IZ

https://izlik.org/JA89YZ92XR

APA

Polatgil, M. (2023). Veri Ölçekleme ve Eksik Veri Tamamlama Yöntemlerinin Makine Öğrenmesi Yöntemlerinin Başarısına Etkisinin İncelenmesi. Duzce University Journal of Science and Technology, 11(1), 78-88. https://doi.org/10.29130/dubited.948564

AMA

1.Polatgil M. Veri Ölçekleme ve Eksik Veri Tamamlama Yöntemlerinin Makine Öğrenmesi Yöntemlerinin Başarısına Etkisinin İncelenmesi. DUBİTED. 2023;11(1):78-88. doi:10.29130/dubited.948564

Chicago

Polatgil, Mesut. 2023. “Veri Ölçekleme Ve Eksik Veri Tamamlama Yöntemlerinin Makine Öğrenmesi Yöntemlerinin Başarısına Etkisinin İncelenmesi”. Duzce University Journal of Science and Technology 11 (1): 78-88. https://doi.org/10.29130/dubited.948564.

EndNote

Polatgil M (January 1, 2023) Veri Ölçekleme ve Eksik Veri Tamamlama Yöntemlerinin Makine Öğrenmesi Yöntemlerinin Başarısına Etkisinin İncelenmesi. Duzce University Journal of Science and Technology 11 1 78–88.

IEEE

[1]M. Polatgil, “Veri Ölçekleme ve Eksik Veri Tamamlama Yöntemlerinin Makine Öğrenmesi Yöntemlerinin Başarısına Etkisinin İncelenmesi”, DUBİTED, vol. 11, no. 1, pp. 78–88, Jan. 2023, doi: 10.29130/dubited.948564.

ISNAD

Polatgil, Mesut. “Veri Ölçekleme Ve Eksik Veri Tamamlama Yöntemlerinin Makine Öğrenmesi Yöntemlerinin Başarısına Etkisinin İncelenmesi”. Duzce University Journal of Science and Technology 11/1 (January 1, 2023): 78-88. https://doi.org/10.29130/dubited.948564.

JAMA

1.Polatgil M. Veri Ölçekleme ve Eksik Veri Tamamlama Yöntemlerinin Makine Öğrenmesi Yöntemlerinin Başarısına Etkisinin İncelenmesi. DUBİTED. 2023;11:78–88.

MLA

Polatgil, Mesut. “Veri Ölçekleme Ve Eksik Veri Tamamlama Yöntemlerinin Makine Öğrenmesi Yöntemlerinin Başarısına Etkisinin İncelenmesi”. Duzce University Journal of Science and Technology, vol. 11, no. 1, Jan. 2023, pp. 78-88, doi:10.29130/dubited.948564.

Vancouver

1.Mesut Polatgil. Veri Ölçekleme ve Eksik Veri Tamamlama Yöntemlerinin Makine Öğrenmesi Yöntemlerinin Başarısına Etkisinin İncelenmesi. DUBİTED. 2023 Jan. 1;11(1):78-8. doi:10.29130/dubited.948564

Cited By

Spatial modeling of chlorophyll-a parameter by Landsat-8 satellite data and deep learning techniques: The case of Lake Mogan

Ömer Halisdemir Üniversitesi Mühendislik Bilimleri Dergisi

https://doi.org/10.28948/ngumuh.1603421

Analysis of the Behavior of The Input Data Set Attributes Affecting the Outputs in MLP Based Artificial Intelligence Models According to the Model

Journal of Information Systems and Management Research

https://doi.org/10.59940/jismar.1577691

Hybrid approach for missing data imputation via correlation-based interpolation and outlier analysis

Neural Computing and Applications

https://doi.org/10.1007/s00521-026-11961-z