Araştırma Makalesi
BibTex RIS Kaynak Göster

Effect of different encoding techniques on the mushroom classification performance of KNN algorithm

Yıl 2025, Cilt: 14 Sayı: 1, 263 - 270, 15.01.2025
https://doi.org/10.28948/ngumuh.1515387

Öz

In this study, the effects of different encoding techniques on the K-Nearest Neighbors (KNN) algorithm in the classification of mushrooms as poisonous or edible were investigated. Various encoding techniques such as label encoding, one-hot encoding, frequency encoding, hash encoding, and target encoding were used to convert categorical features in a dataset, which mostly contains categorical features, into numerical data. The performance of the model was evaluated using metrics such as accuracy, precision, recall, and f1-score. The results revealed that frequency encoding showed the best performance at k=1, while target encoding showed the lowest performance at k=7. The findings of the study provide significant insights into understanding the impact of categorical data transformation on the KNN model and achieving more accurate classification results.

Kaynakça

  • C. Pan, A. Poddar, R. Mukherjee, and A.K. Ray, Impact of categorical and numerical features in ensemble machine learning frameworks for heart disease prediction. Biomedical Signal Processing and Control, 76, 103666, 2022. https://doi.org/10.1016/j.bspc.2022.103666.
  •     K.S. Sree, J. Karthik, C. Niharika, P.V.V.S. Srinivas, N. Ravinder, and C. Prasad, Optimized conversion of categorical and numerical features in machine learning models. In 2021 Fifth International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC), pp. 294-299, IEEE, November 2021. https://doi.org/10.1109/I-SMAC52330.2021.9640967.
  •     G. Guo, H. Wang, D. Bell, Y. Bi, and K. Greer, KNN model-based approach in classification. in On The Move to Meaningful Internet Systems 2003: CoopIS, DOA, and ODBASE: OTM Confederated International Conferences, CoopIS, DOA, and ODBASE 2003, Catania, Sicily, Italy, pp. 986-996, November 3-7, 2003.
  •     H. Gupta and V. Asha, Impact of encoding of high cardinality categorical data to solve prediction problems. Journal of Computational and Theoretical Nanoscience, vol. 17, no. 9-10, pp. 4197-4201, 2020. https://doi.org/10.1166/jctn.2020.9044.
  •     P. Yan, Anomaly Detection in Categorical Data with Interpretable Machine Learning: A random forest approach to classify imbalanced data. 2019.
  •     K. Budholiya, S.K. Shrivastava, and V. Sharma, An optimized XGBoost based diagnostic system for effective prediction of heart disease. Journal of King Saud University-Computer and Information Sciences, vol. 34, no. 7, pp. 4514-4523, 2022. https://doi.org/10.1016/j.jksuci.2020.10.013.
  •     T. Al-Shehari and R.A. Alsowail, An insider data leakage detection using one-hot encoding, synthetic minority oversampling, and machine learning techniques. Entropy, vol. 23, no. 10, p. 1258, 2021. https://doi.org/10.3390/e23101258.
  •     M. Hosni, Encoding Techniques for Handling Categorical Data in Machine Learning-Based Software Development Effort Estimation. in KDIR, pp. 460-467, 2023.
  •     M.X. Low, T.T.V. Yap, W.K. Soo, H. Ng, V.T. Goh, J.J. Chin, and T.Y. Kuek, Comparison of label encoding and evidence counting for malware classification. Journal of System and Management Sciences, vol. 12, no. 6, pp. 17-30, 2022. https://doi.org/10.33168/JSMS.2022.0602.
  •   S.K. Das and M.Z. Rahman, A Study on Machine Learning Algorithms with Different Encoding Techniques for Identifying the Right One for Patients' Big Data. Jahangirnagar University Journal of Science, vol. 43, no. 1, pp. 63-78, 2021.
  •   F. Pargent, F. Pfisterer, J. Thomas, and B. Bischl, Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features. Computational Statistics, vol. 37, no. 5, pp. 2671-2692, 2022. https://doi.org/10.1007/s00180-022-01207-6.
  •   A.S. Mohanty, K.C. Patra, and P. Parida, Toddler ASD Classification Using Machine Learning Techniques. International Journal of Online & Biomedical Engineering, vol. 17, no. 7, 2021. https://doi.org/ 10.3991/ijoe.v17i07.23497.
  •   S. Zhang, Y. Yuan, Z. Yao, X. Wang, and Z. Lei, Improvement of the performance of models for predicting coronary artery disease based on XGBoost algorithm and feature processing technology. Electronics, vol. 11, no. 3, p. 315, 2022. https://doi.org/10.3390/electronics11030315.
  •   L.B. Nascimento, M. de Sousa Balbino, M.L. Teodoro, and C.N. Nobre, Assessment of the Relationship Between Attribute Coding and the Interpretability of Machine Learning Models: An Analysis in the Context of Children and Adolescents with Depression. In BIOSTEC (2), pp. 482-489, 2024.
  •   F. Pargent, B. Bischl, and J. Thomas, A benchmark experiment on how to encode categorical features in predictive modeling. München: Ludwig-Maximilians-Universität München, 2019.
  •   D. Wagner, D. Heider, and G. Hattab, Mushroom data creation, curation, and simulation to support classification tasks. Scientific Reports, vol. 11, no. 1, p. 8134, 2021. https://doi.org/10.1038/s41598-021-87602-3.
  •   UCI Machine Learning Repository, Secondary Mushroom. https://archive.ics.uci.edu/dataset/848/ secondary +mushroom+dataset, Accessed 25 June 2024.
  •   UCI Machine Learning Repository, Mushroom. https://archive.ics.uci.edu/dataset/73/mushroom, Accessed 25 June 2024.
  •   M.K. Dahouda and I. Joe, A deep-learned embedding technique for categorical features encoding. IEEE Access, vol. 9, pp. 114381-114391, 2021. https://doi.org/10.1109/ACCESS.2021.3104357.
  •   C. Seger, An investigation of categorical variable encoding techniques in machine learning: binary versus one-hot and feature hashing. 2018.
  •   C.T.T. Thuy, K.A. Tran, and C.N. Giap, Optimize the combination of categorical variable encoding and deep learning technique for the problem of prediction of Vietnamese student academic performance. International Journal of Advanced Computer Science and Applications, vol. 11, no. 11, 2020. https://doi.org/10.14569/IJACSA.2020.0111135.
  •   I. Lopez-Arevalo, E. Aldana-Bobadilla, A. Molina-Villegas, H. Galeana-Zapién, V. Muñiz-Sanchez, and S. Gausin-Valle, A memory-efficient encoding method for processing mixed-type data on machine learning. Entropy, vol. 22, no. 12, p. 1391, 2020. https://doi.org/10.3390/e22121391.
  •   S. Mumtaz and M. Giese, Hierarchy-based semantic embeddings for single-valued & multi-valued categorical variables. Journal of Intelligent Information Systems, vol. 58, no. 3, pp. 613-640, 2022. https://doi.org/10.1007/s10844-021-00693-2.
  •   A. Almomany, W.R. Ayyad, and A. Jarrah, Optimized implementation of an improved KNN classification algorithm using Intel FPGA platform: Covid-19 case study. Journal of King Saud University-Computer and Information Sciences, vol. 34, no. 6, pp. 3815-3827, 2022. https://doi.org/10.1016/j.jksuci.2022.04.006.
  •   J. Lever, Classification evaluation: It is important to understand both what a classification metric expresses and what it hides. Nature Methods, vol. 13, no. 8, pp. 603-605, 2016.
  •   Ž. Vujović, Classification model evaluation metrics. International Journal of Advanced Computer Science and Applications, vol. 12, no. 6, pp. 599-606, 2021.
  •   H. Jabbar and R.Z. Khan, Methods to avoid over-fitting and under-fitting in supervised machine learning (comparative study). Computer Science, Communication and Instrumentation Devices, vol. 70, no. 10.3850, pp. 978-981, 2015.

Farklı kodlama tekniklerinin KNN algoritmasının mantar sınıflandırma performansı üzerindeki etkisi

Yıl 2025, Cilt: 14 Sayı: 1, 263 - 270, 15.01.2025
https://doi.org/10.28948/ngumuh.1515387

Öz

Bu çalışmada, mantarların zehirli veya yenilebilir olarak sınıflandırılmasında farklı kodlama tekniklerinin K-En Yakın Komşu (KNN) algoritması üzerindeki etkisi araştırılmıştır. Etiket kodlama, one-hot kodlama, frekans kodlama, hash kodlama ve hedef kodlama gibi çeşitli kodlama teknikleri kullanılarak, çoğunlukla kategorik özellikler içiren bir veri setindeki kategorik özellikler sayısal verilere dönüştürülmüştür. Modelin performansı doğruluk, kesinlik, duyarlılık ve f1-skoru gibi metriklerle değerlendirilmiştir. Sonuçlar, frekans kodlamanın k=1 durumunda en iyi performansı sergilediğini, hedef kodlamanın ise k=7 durumunda en düşük performansı gösterdiğini ortaya koymuştur. Çalışmanın bulguları, kategorik veri dönüşümünün KNN modeli üzerindeki etkilerini anlamak ve daha doğru sınıflandırma sonuçları elde etmek için önemli ipuçları sunmaktadır.

Kaynakça

  • C. Pan, A. Poddar, R. Mukherjee, and A.K. Ray, Impact of categorical and numerical features in ensemble machine learning frameworks for heart disease prediction. Biomedical Signal Processing and Control, 76, 103666, 2022. https://doi.org/10.1016/j.bspc.2022.103666.
  •     K.S. Sree, J. Karthik, C. Niharika, P.V.V.S. Srinivas, N. Ravinder, and C. Prasad, Optimized conversion of categorical and numerical features in machine learning models. In 2021 Fifth International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC), pp. 294-299, IEEE, November 2021. https://doi.org/10.1109/I-SMAC52330.2021.9640967.
  •     G. Guo, H. Wang, D. Bell, Y. Bi, and K. Greer, KNN model-based approach in classification. in On The Move to Meaningful Internet Systems 2003: CoopIS, DOA, and ODBASE: OTM Confederated International Conferences, CoopIS, DOA, and ODBASE 2003, Catania, Sicily, Italy, pp. 986-996, November 3-7, 2003.
  •     H. Gupta and V. Asha, Impact of encoding of high cardinality categorical data to solve prediction problems. Journal of Computational and Theoretical Nanoscience, vol. 17, no. 9-10, pp. 4197-4201, 2020. https://doi.org/10.1166/jctn.2020.9044.
  •     P. Yan, Anomaly Detection in Categorical Data with Interpretable Machine Learning: A random forest approach to classify imbalanced data. 2019.
  •     K. Budholiya, S.K. Shrivastava, and V. Sharma, An optimized XGBoost based diagnostic system for effective prediction of heart disease. Journal of King Saud University-Computer and Information Sciences, vol. 34, no. 7, pp. 4514-4523, 2022. https://doi.org/10.1016/j.jksuci.2020.10.013.
  •     T. Al-Shehari and R.A. Alsowail, An insider data leakage detection using one-hot encoding, synthetic minority oversampling, and machine learning techniques. Entropy, vol. 23, no. 10, p. 1258, 2021. https://doi.org/10.3390/e23101258.
  •     M. Hosni, Encoding Techniques for Handling Categorical Data in Machine Learning-Based Software Development Effort Estimation. in KDIR, pp. 460-467, 2023.
  •     M.X. Low, T.T.V. Yap, W.K. Soo, H. Ng, V.T. Goh, J.J. Chin, and T.Y. Kuek, Comparison of label encoding and evidence counting for malware classification. Journal of System and Management Sciences, vol. 12, no. 6, pp. 17-30, 2022. https://doi.org/10.33168/JSMS.2022.0602.
  •   S.K. Das and M.Z. Rahman, A Study on Machine Learning Algorithms with Different Encoding Techniques for Identifying the Right One for Patients' Big Data. Jahangirnagar University Journal of Science, vol. 43, no. 1, pp. 63-78, 2021.
  •   F. Pargent, F. Pfisterer, J. Thomas, and B. Bischl, Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features. Computational Statistics, vol. 37, no. 5, pp. 2671-2692, 2022. https://doi.org/10.1007/s00180-022-01207-6.
  •   A.S. Mohanty, K.C. Patra, and P. Parida, Toddler ASD Classification Using Machine Learning Techniques. International Journal of Online & Biomedical Engineering, vol. 17, no. 7, 2021. https://doi.org/ 10.3991/ijoe.v17i07.23497.
  •   S. Zhang, Y. Yuan, Z. Yao, X. Wang, and Z. Lei, Improvement of the performance of models for predicting coronary artery disease based on XGBoost algorithm and feature processing technology. Electronics, vol. 11, no. 3, p. 315, 2022. https://doi.org/10.3390/electronics11030315.
  •   L.B. Nascimento, M. de Sousa Balbino, M.L. Teodoro, and C.N. Nobre, Assessment of the Relationship Between Attribute Coding and the Interpretability of Machine Learning Models: An Analysis in the Context of Children and Adolescents with Depression. In BIOSTEC (2), pp. 482-489, 2024.
  •   F. Pargent, B. Bischl, and J. Thomas, A benchmark experiment on how to encode categorical features in predictive modeling. München: Ludwig-Maximilians-Universität München, 2019.
  •   D. Wagner, D. Heider, and G. Hattab, Mushroom data creation, curation, and simulation to support classification tasks. Scientific Reports, vol. 11, no. 1, p. 8134, 2021. https://doi.org/10.1038/s41598-021-87602-3.
  •   UCI Machine Learning Repository, Secondary Mushroom. https://archive.ics.uci.edu/dataset/848/ secondary +mushroom+dataset, Accessed 25 June 2024.
  •   UCI Machine Learning Repository, Mushroom. https://archive.ics.uci.edu/dataset/73/mushroom, Accessed 25 June 2024.
  •   M.K. Dahouda and I. Joe, A deep-learned embedding technique for categorical features encoding. IEEE Access, vol. 9, pp. 114381-114391, 2021. https://doi.org/10.1109/ACCESS.2021.3104357.
  •   C. Seger, An investigation of categorical variable encoding techniques in machine learning: binary versus one-hot and feature hashing. 2018.
  •   C.T.T. Thuy, K.A. Tran, and C.N. Giap, Optimize the combination of categorical variable encoding and deep learning technique for the problem of prediction of Vietnamese student academic performance. International Journal of Advanced Computer Science and Applications, vol. 11, no. 11, 2020. https://doi.org/10.14569/IJACSA.2020.0111135.
  •   I. Lopez-Arevalo, E. Aldana-Bobadilla, A. Molina-Villegas, H. Galeana-Zapién, V. Muñiz-Sanchez, and S. Gausin-Valle, A memory-efficient encoding method for processing mixed-type data on machine learning. Entropy, vol. 22, no. 12, p. 1391, 2020. https://doi.org/10.3390/e22121391.
  •   S. Mumtaz and M. Giese, Hierarchy-based semantic embeddings for single-valued & multi-valued categorical variables. Journal of Intelligent Information Systems, vol. 58, no. 3, pp. 613-640, 2022. https://doi.org/10.1007/s10844-021-00693-2.
  •   A. Almomany, W.R. Ayyad, and A. Jarrah, Optimized implementation of an improved KNN classification algorithm using Intel FPGA platform: Covid-19 case study. Journal of King Saud University-Computer and Information Sciences, vol. 34, no. 6, pp. 3815-3827, 2022. https://doi.org/10.1016/j.jksuci.2022.04.006.
  •   J. Lever, Classification evaluation: It is important to understand both what a classification metric expresses and what it hides. Nature Methods, vol. 13, no. 8, pp. 603-605, 2016.
  •   Ž. Vujović, Classification model evaluation metrics. International Journal of Advanced Computer Science and Applications, vol. 12, no. 6, pp. 599-606, 2021.
  •   H. Jabbar and R.Z. Khan, Methods to avoid over-fitting and under-fitting in supervised machine learning (comparative study). Computer Science, Communication and Instrumentation Devices, vol. 70, no. 10.3850, pp. 978-981, 2015.
Toplam 27 adet kaynakça vardır.

Ayrıntılar

Birincil Dil Türkçe
Konular Takviyeli Öğrenme, Memnuniyet ve Optimizasyon, Yapay Zeka (Diğer)
Bölüm Araştırma Makaleleri
Yazarlar

Kadir İleri 0000-0002-5041-6165

Erken Görünüm Tarihi 25 Aralık 2024
Yayımlanma Tarihi 15 Ocak 2025
Gönderilme Tarihi 12 Temmuz 2024
Kabul Tarihi 16 Aralık 2024
Yayımlandığı Sayı Yıl 2025 Cilt: 14 Sayı: 1

Kaynak Göster

APA İleri, K. (2025). Farklı kodlama tekniklerinin KNN algoritmasının mantar sınıflandırma performansı üzerindeki etkisi. Niğde Ömer Halisdemir Üniversitesi Mühendislik Bilimleri Dergisi, 14(1), 263-270. https://doi.org/10.28948/ngumuh.1515387
AMA İleri K. Farklı kodlama tekniklerinin KNN algoritmasının mantar sınıflandırma performansı üzerindeki etkisi. NÖHÜ Müh. Bilim. Derg. Ocak 2025;14(1):263-270. doi:10.28948/ngumuh.1515387
Chicago İleri, Kadir. “Farklı Kodlama Tekniklerinin KNN algoritmasının Mantar sınıflandırma Performansı üzerindeki Etkisi”. Niğde Ömer Halisdemir Üniversitesi Mühendislik Bilimleri Dergisi 14, sy. 1 (Ocak 2025): 263-70. https://doi.org/10.28948/ngumuh.1515387.
EndNote İleri K (01 Ocak 2025) Farklı kodlama tekniklerinin KNN algoritmasının mantar sınıflandırma performansı üzerindeki etkisi. Niğde Ömer Halisdemir Üniversitesi Mühendislik Bilimleri Dergisi 14 1 263–270.
IEEE K. İleri, “Farklı kodlama tekniklerinin KNN algoritmasının mantar sınıflandırma performansı üzerindeki etkisi”, NÖHÜ Müh. Bilim. Derg., c. 14, sy. 1, ss. 263–270, 2025, doi: 10.28948/ngumuh.1515387.
ISNAD İleri, Kadir. “Farklı Kodlama Tekniklerinin KNN algoritmasının Mantar sınıflandırma Performansı üzerindeki Etkisi”. Niğde Ömer Halisdemir Üniversitesi Mühendislik Bilimleri Dergisi 14/1 (Ocak 2025), 263-270. https://doi.org/10.28948/ngumuh.1515387.
JAMA İleri K. Farklı kodlama tekniklerinin KNN algoritmasının mantar sınıflandırma performansı üzerindeki etkisi. NÖHÜ Müh. Bilim. Derg. 2025;14:263–270.
MLA İleri, Kadir. “Farklı Kodlama Tekniklerinin KNN algoritmasının Mantar sınıflandırma Performansı üzerindeki Etkisi”. Niğde Ömer Halisdemir Üniversitesi Mühendislik Bilimleri Dergisi, c. 14, sy. 1, 2025, ss. 263-70, doi:10.28948/ngumuh.1515387.
Vancouver İleri K. Farklı kodlama tekniklerinin KNN algoritmasının mantar sınıflandırma performansı üzerindeki etkisi. NÖHÜ Müh. Bilim. Derg. 2025;14(1):263-70.

download