TY - JOUR T1 - Farklı kodlama tekniklerinin KNN algoritmasının mantar sınıflandırma performansı üzerindeki etkisi TT - Effect of different encoding techniques on the mushroom classification performance of KNN algorithm AU - İleri, Kadir PY - 2025 DA - January Y2 - 2024 DO - 10.28948/ngumuh.1515387 JF - Niğde Ömer Halisdemir Üniversitesi Mühendislik Bilimleri Dergisi JO - NÖHÜ Müh. Bilim. Derg. PB - Nigde Omer Halisdemir University WT - DergiPark SN - 2564-6605 SP - 263 EP - 270 VL - 14 IS - 1 LA - tr AB - Bu çalışmada, mantarların zehirli veya yenilebilir olarak sınıflandırılmasında farklı kodlama tekniklerinin K-En Yakın Komşu (KNN) algoritması üzerindeki etkisi araştırılmıştır. Etiket kodlama, one-hot kodlama, frekans kodlama, hash kodlama ve hedef kodlama gibi çeşitli kodlama teknikleri kullanılarak, çoğunlukla kategorik özellikler içiren bir veri setindeki kategorik özellikler sayısal verilere dönüştürülmüştür. Modelin performansı doğruluk, kesinlik, duyarlılık ve f1-skoru gibi metriklerle değerlendirilmiştir. Sonuçlar, frekans kodlamanın k=1 durumunda en iyi performansı sergilediğini, hedef kodlamanın ise k=7 durumunda en düşük performansı gösterdiğini ortaya koymuştur. Çalışmanın bulguları, kategorik veri dönüşümünün KNN modeli üzerindeki etkilerini anlamak ve daha doğru sınıflandırma sonuçları elde etmek için önemli ipuçları sunmaktadır. KW - KNN sınıflandırıcısı KW - Kategorik veri KW - Etiket kodlama KW - One-hot kodlama KW - Frekans kodlama N2 - In this study, the effects of different encoding techniques on the K-Nearest Neighbors (KNN) algorithm in the classification of mushrooms as poisonous or edible were investigated. Various encoding techniques such as label encoding, one-hot encoding, frequency encoding, hash encoding, and target encoding were used to convert categorical features in a dataset, which mostly contains categorical features, into numerical data. The performance of the model was evaluated using metrics such as accuracy, precision, recall, and f1-score. The results revealed that frequency encoding showed the best performance at k=1, while target encoding showed the lowest performance at k=7. The findings of the study provide significant insights into understanding the impact of categorical data transformation on the KNN model and achieving more accurate classification results. CR - C. Pan, A. Poddar, R. Mukherjee, and A.K. Ray, Impact of categorical and numerical features in ensemble machine learning frameworks for heart disease prediction. Biomedical Signal Processing and Control, 76, 103666, 2022. https://doi.org/10.1016/j.bspc.2022.103666. CR -     K.S. Sree, J. Karthik, C. Niharika, P.V.V.S. Srinivas, N. Ravinder, and C. Prasad, Optimized conversion of categorical and numerical features in machine learning models. In 2021 Fifth International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC), pp. 294-299, IEEE, November 2021. https://doi.org/10.1109/I-SMAC52330.2021.9640967. CR -     G. Guo, H. Wang, D. Bell, Y. Bi, and K. Greer, KNN model-based approach in classification. in On The Move to Meaningful Internet Systems 2003: CoopIS, DOA, and ODBASE: OTM Confederated International Conferences, CoopIS, DOA, and ODBASE 2003, Catania, Sicily, Italy, pp. 986-996, November 3-7, 2003. CR -     H. Gupta and V. Asha, Impact of encoding of high cardinality categorical data to solve prediction problems. Journal of Computational and Theoretical Nanoscience, vol. 17, no. 9-10, pp. 4197-4201, 2020. https://doi.org/10.1166/jctn.2020.9044. CR -     P. Yan, Anomaly Detection in Categorical Data with Interpretable Machine Learning: A random forest approach to classify imbalanced data. 2019. CR -     K. Budholiya, S.K. Shrivastava, and V. Sharma, An optimized XGBoost based diagnostic system for effective prediction of heart disease. Journal of King Saud University-Computer and Information Sciences, vol. 34, no. 7, pp. 4514-4523, 2022. https://doi.org/10.1016/j.jksuci.2020.10.013. CR -     T. Al-Shehari and R.A. Alsowail, An insider data leakage detection using one-hot encoding, synthetic minority oversampling, and machine learning techniques. Entropy, vol. 23, no. 10, p. 1258, 2021. https://doi.org/10.3390/e23101258. CR -     M. Hosni, Encoding Techniques for Handling Categorical Data in Machine Learning-Based Software Development Effort Estimation. in KDIR, pp. 460-467, 2023. CR -     M.X. Low, T.T.V. Yap, W.K. Soo, H. Ng, V.T. Goh, J.J. Chin, and T.Y. Kuek, Comparison of label encoding and evidence counting for malware classification. Journal of System and Management Sciences, vol. 12, no. 6, pp. 17-30, 2022. https://doi.org/10.33168/JSMS.2022.0602. CR -   S.K. Das and M.Z. Rahman, A Study on Machine Learning Algorithms with Different Encoding Techniques for Identifying the Right One for Patients' Big Data. Jahangirnagar University Journal of Science, vol. 43, no. 1, pp. 63-78, 2021. CR -   F. Pargent, F. Pfisterer, J. Thomas, and B. Bischl, Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features. Computational Statistics, vol. 37, no. 5, pp. 2671-2692, 2022. https://doi.org/10.1007/s00180-022-01207-6. CR -   A.S. Mohanty, K.C. Patra, and P. Parida, Toddler ASD Classification Using Machine Learning Techniques. International Journal of Online & Biomedical Engineering, vol. 17, no. 7, 2021. https://doi.org/ 10.3991/ijoe.v17i07.23497. CR -   S. Zhang, Y. Yuan, Z. Yao, X. Wang, and Z. Lei, Improvement of the performance of models for predicting coronary artery disease based on XGBoost algorithm and feature processing technology. Electronics, vol. 11, no. 3, p. 315, 2022. https://doi.org/10.3390/electronics11030315. CR -   L.B. Nascimento, M. de Sousa Balbino, M.L. Teodoro, and C.N. Nobre, Assessment of the Relationship Between Attribute Coding and the Interpretability of Machine Learning Models: An Analysis in the Context of Children and Adolescents with Depression. In BIOSTEC (2), pp. 482-489, 2024. CR -   F. Pargent, B. Bischl, and J. Thomas, A benchmark experiment on how to encode categorical features in predictive modeling. München: Ludwig-Maximilians-Universität München, 2019. CR -   D. Wagner, D. Heider, and G. Hattab, Mushroom data creation, curation, and simulation to support classification tasks. Scientific Reports, vol. 11, no. 1, p. 8134, 2021. https://doi.org/10.1038/s41598-021-87602-3. CR -   UCI Machine Learning Repository, Secondary Mushroom. https://archive.ics.uci.edu/dataset/848/ secondary +mushroom+dataset, Accessed 25 June 2024. CR -   UCI Machine Learning Repository, Mushroom. https://archive.ics.uci.edu/dataset/73/mushroom, Accessed 25 June 2024. CR -   M.K. Dahouda and I. Joe, A deep-learned embedding technique for categorical features encoding. IEEE Access, vol. 9, pp. 114381-114391, 2021. https://doi.org/10.1109/ACCESS.2021.3104357. CR -   C. Seger, An investigation of categorical variable encoding techniques in machine learning: binary versus one-hot and feature hashing. 2018. CR -   C.T.T. Thuy, K.A. Tran, and C.N. Giap, Optimize the combination of categorical variable encoding and deep learning technique for the problem of prediction of Vietnamese student academic performance. International Journal of Advanced Computer Science and Applications, vol. 11, no. 11, 2020. https://doi.org/10.14569/IJACSA.2020.0111135. CR -   I. Lopez-Arevalo, E. Aldana-Bobadilla, A. Molina-Villegas, H. Galeana-Zapién, V. Muñiz-Sanchez, and S. Gausin-Valle, A memory-efficient encoding method for processing mixed-type data on machine learning. Entropy, vol. 22, no. 12, p. 1391, 2020. https://doi.org/10.3390/e22121391. CR -   S. Mumtaz and M. Giese, Hierarchy-based semantic embeddings for single-valued & multi-valued categorical variables. Journal of Intelligent Information Systems, vol. 58, no. 3, pp. 613-640, 2022. https://doi.org/10.1007/s10844-021-00693-2. CR -   A. Almomany, W.R. Ayyad, and A. Jarrah, Optimized implementation of an improved KNN classification algorithm using Intel FPGA platform: Covid-19 case study. Journal of King Saud University-Computer and Information Sciences, vol. 34, no. 6, pp. 3815-3827, 2022. https://doi.org/10.1016/j.jksuci.2022.04.006. CR -   J. Lever, Classification evaluation: It is important to understand both what a classification metric expresses and what it hides. Nature Methods, vol. 13, no. 8, pp. 603-605, 2016. CR -   Ž. Vujović, Classification model evaluation metrics. International Journal of Advanced Computer Science and Applications, vol. 12, no. 6, pp. 599-606, 2021. CR -   H. Jabbar and R.Z. Khan, Methods to avoid over-fitting and under-fitting in supervised machine learning (comparative study). Computer Science, Communication and Instrumentation Devices, vol. 70, no. 10.3850, pp. 978-981, 2015. UR - https://doi.org/10.28948/ngumuh.1515387 L1 - https://dergipark.org.tr/en/download/article-file/4067094 ER -