Research Article
BibTex RIS Cite

Year 2025, Volume: 12 Issue: 4, 197 - 207, 31.12.2025
https://doi.org/10.17350/HJSE19030000366

Abstract

References

  • Mananayaka AK, Chung SS. Network intrusion detection with two-phased hybrid ensemble learning and automatic feature selection. IEEE Access. 2023 May 8;11:45154-67.
  • Güney H. Preprocessing impact analysis for machine learning-based network intrusion detection. Sakarya University Journal of Computer and Information Sciences. 2023 Apr 30;6(1):67-79.
  • Ramzan HA, Abdulah F, Ahmad M, Ramzan S, Ashraf M. AI-Driven Personalization of E-Therapy Interventions for Anxiety, Stress, and Depression. In2024 18th International Conference on Open Source Systems and Technologies (ICOSST) 2024 Dec 26 (pp. 1-6). IEEE.
  • Dahouda MK, Joe I. A deep-learned embedding technique for categorical features encoding. IEEE access. 2021 Aug 12;9:114381-91.
  • Lopez-Arevalo I, Aldana-Bobadilla E, Molina-Villegas A, Galeana-Zapién H, Muñiz-Sanchez V, Gausin-Valle S. A memory-efficient encoding method for processing mixed-type data on machine learning. Entropy. 2020 Dec 9;22(12):1391.
  • Kim T, Suh SC, Kim H, Kim J, Kim J. An encoding technique for CNN-based network anomaly detection. In2018 IEEE International Conference on Big Data (Big Data) 2018 Dec 10 (pp. 2960-2965). IEEE.
  • Breskuvienė D, Dzemyda G. Categorical feature encoding techniques for improved classifier performance when dealing with imbalanced data of fraudulent transactions. International Journal of Computers Communications & Control. 2023 May 9;18(3).
  • Hussein AY, Falcarin P, Sadiq AT. Enhancement performance of random forest algorithm via one hot encoding for IoT IDS. Periodicals of Engineering and Natural Sciences. 2021;9(3):579-91.
  • Disha RA, Waheed S. Performance analysis of machine learning models for intrusion detection system using Gini Impurity-based Weighted Random Forest (GIWRF) feature selection technique. Cybersecurity. 2022 Jan 4;5(1):1.
  • Cerda P, Varoquaux G. Encoding high-cardinality string categorical variables. IEEE Transactions on Knowledge and Data Engineering. 2020 May 4;34(3):1164-76.
  • Tavallaee M, Bagheri E, Lu W, Ghorbani AA. A detailed analysis of the KDD CUP 99 data set. In2009 IEEE symposium on computational intelligence for security and defense applications 2009 Jul 8 (pp. 1-6). Ieee.
  • Moustafa N, Slay J. The evaluation of Network Anomaly Detection Systems: Statistical analysis of the UNSW-NB15 data set and the comparison with the KDD99 data set. Information Security Journal: A Global Perspective. 2016 Apr 4;25(1-3):18-31.
  • A Cybersecurity Dataset was accessed from Kaggle. The access link to the dataset is as follows: https://www.kaggle.com/datasets/dnkumars/cybersecurity-intrusion-detection-dataset.
  • Cadenas JM, Garrido MC, MartíNez R. Feature subset selection filter–wrapper based on low quality data. Expert systems with applications. 2013 Nov 15;40(16):6241-52.
  • Alzahrani AO, Alenazi MJ. Designing a network intrusion detection system based on machine learning for software defined networks. Future Internet. 2021 Apr 28;13(5):111.
  • Lantz B. Machine learning with R: expert techniques for predictive modeling. Packt publishing ltd; 2019 Apr 15.
  • Oshiro TM, Perez PS, Baranauskas JA. How many trees in a random forest?. InInternational workshop on machine learning and data mining in pattern recognition 2012 Jul 13 (pp. 154-168). Berlin, Heidelberg: Springer Berlin Heidelberg.
  • Pathak A, Pathak S. Study on decision tree and KNN algorithm for intrusion detection system. International Journal of Engineering Research & Technology. 2020 May;9(5):376-81.
  • R Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www. R-project. org/. 2016.
  • Team R. RStudio: Integrated development environment for R. RStudio, PBC [Internet]. 2020
  • Kuhn M, Quinlan R (2025). C50: C5.0 Decision Trees and Rule-Based Models. R package version 0.2.0.9000, https://github.com/topepo/c5.0.
  • Therneau T, Atkinson B. rpart: Recursive partitioning and regression trees. (No Title). 1999 Apr 8.
  • Liaw A, Wiener M. Classification and regression by randomForest. R news. 2002 Dec 3;2(3):18-22.
  • Meyer D, Dimitriadou E, Hornik K, Weingessel A, Leisch F, Chang CC, Lin CC. e1071: misc functions of the department of statistics, probability theory group (formerly: E1071), TU Wien. R package version. 2019;1(2).
  • Venables WN, Ripley BD. Modern applied statistics with S. Springer Science & Business Media; 2013 Mar 9.
  • Kuhn M, Wickham H. Tidymodels: a collection of packages for modeling and machine learning using tidyverse principles [Internet]. 2020
  • Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, Desmaison A. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems. 2019;32.
  • Güney H. Feature selection‐integrated classifier optimisation algorithm for network intrusion detection. Concurrency and Computation: Practice and Experience. 2023 Oct 25;35(23):e7807.
  • Perlich C, Provost F, Simonoff JS. Tree induction vs. logistic regression: A learning-curve analysis. Journal of Machine Learning Research. 2003;4(Jun):211-55.
  • Hosmer Jr DW, Lemeshow S, Sturdivant RX. Applied logistic regression. John Wiley & Sons; 2013 Feb 26.
  • Güney H. A Fast‐Optimizing and Adaptable Intrusion Detection System Based on Progressively Optimized Support Vector Machines. Concurrency and Computation: Practice and Experience. 2025 Jul 25;37(15-17):e70156.

Investigating the Impact of One-Hot Encoding on Classification Performance and Time Complexity for Efficient Network Intrusion Detection

Year 2025, Volume: 12 Issue: 4, 197 - 207, 31.12.2025
https://doi.org/10.17350/HJSE19030000366

Abstract

With the increasing number of network users, intrusion detection systems (IDS) have become a critical area of focus. The deployment of machine learning (ML)-based systems is crucial due to their ability to learn from data. However, the network data often contains both numerical and categorical features. This presents a significant challenge as some ML algorithms, such as Support Vector Machine (SVM) and k-Nearest Neighbour (kNN), require encoding before using categorical features. Here, we investigate the impact of One-Hot Encoding (OHE) on the classification performance and time complexity of ML algorithms, including Decision Trees (DTs) (which accept categorical features), SVM, kNN, and others. In this study, intrusion datasets such as NSLKDD and UNSWNB15, which contain categorical features, are used. The performance of DTs and other classifiers was compared on encoded and unencoded datasets. Our findings are: (1) OHE can improve the classification performance of DT classifiers, and it does not negatively affect DT classifiers. However, OHE increases the time complexity due to increased dimensionality; (2) comparing the performance of DT with other classifiers showed that DT achieve a comparable performance with less time complexity. (3) OHE can help to transform complex categorical features to eliminate irrelevant categories. The results of this experiment are presented to visualise the importance of the properties of DTs. This study shows that DTs are promising in developing time-efficient and accurate IDS.

References

  • Mananayaka AK, Chung SS. Network intrusion detection with two-phased hybrid ensemble learning and automatic feature selection. IEEE Access. 2023 May 8;11:45154-67.
  • Güney H. Preprocessing impact analysis for machine learning-based network intrusion detection. Sakarya University Journal of Computer and Information Sciences. 2023 Apr 30;6(1):67-79.
  • Ramzan HA, Abdulah F, Ahmad M, Ramzan S, Ashraf M. AI-Driven Personalization of E-Therapy Interventions for Anxiety, Stress, and Depression. In2024 18th International Conference on Open Source Systems and Technologies (ICOSST) 2024 Dec 26 (pp. 1-6). IEEE.
  • Dahouda MK, Joe I. A deep-learned embedding technique for categorical features encoding. IEEE access. 2021 Aug 12;9:114381-91.
  • Lopez-Arevalo I, Aldana-Bobadilla E, Molina-Villegas A, Galeana-Zapién H, Muñiz-Sanchez V, Gausin-Valle S. A memory-efficient encoding method for processing mixed-type data on machine learning. Entropy. 2020 Dec 9;22(12):1391.
  • Kim T, Suh SC, Kim H, Kim J, Kim J. An encoding technique for CNN-based network anomaly detection. In2018 IEEE International Conference on Big Data (Big Data) 2018 Dec 10 (pp. 2960-2965). IEEE.
  • Breskuvienė D, Dzemyda G. Categorical feature encoding techniques for improved classifier performance when dealing with imbalanced data of fraudulent transactions. International Journal of Computers Communications & Control. 2023 May 9;18(3).
  • Hussein AY, Falcarin P, Sadiq AT. Enhancement performance of random forest algorithm via one hot encoding for IoT IDS. Periodicals of Engineering and Natural Sciences. 2021;9(3):579-91.
  • Disha RA, Waheed S. Performance analysis of machine learning models for intrusion detection system using Gini Impurity-based Weighted Random Forest (GIWRF) feature selection technique. Cybersecurity. 2022 Jan 4;5(1):1.
  • Cerda P, Varoquaux G. Encoding high-cardinality string categorical variables. IEEE Transactions on Knowledge and Data Engineering. 2020 May 4;34(3):1164-76.
  • Tavallaee M, Bagheri E, Lu W, Ghorbani AA. A detailed analysis of the KDD CUP 99 data set. In2009 IEEE symposium on computational intelligence for security and defense applications 2009 Jul 8 (pp. 1-6). Ieee.
  • Moustafa N, Slay J. The evaluation of Network Anomaly Detection Systems: Statistical analysis of the UNSW-NB15 data set and the comparison with the KDD99 data set. Information Security Journal: A Global Perspective. 2016 Apr 4;25(1-3):18-31.
  • A Cybersecurity Dataset was accessed from Kaggle. The access link to the dataset is as follows: https://www.kaggle.com/datasets/dnkumars/cybersecurity-intrusion-detection-dataset.
  • Cadenas JM, Garrido MC, MartíNez R. Feature subset selection filter–wrapper based on low quality data. Expert systems with applications. 2013 Nov 15;40(16):6241-52.
  • Alzahrani AO, Alenazi MJ. Designing a network intrusion detection system based on machine learning for software defined networks. Future Internet. 2021 Apr 28;13(5):111.
  • Lantz B. Machine learning with R: expert techniques for predictive modeling. Packt publishing ltd; 2019 Apr 15.
  • Oshiro TM, Perez PS, Baranauskas JA. How many trees in a random forest?. InInternational workshop on machine learning and data mining in pattern recognition 2012 Jul 13 (pp. 154-168). Berlin, Heidelberg: Springer Berlin Heidelberg.
  • Pathak A, Pathak S. Study on decision tree and KNN algorithm for intrusion detection system. International Journal of Engineering Research & Technology. 2020 May;9(5):376-81.
  • R Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www. R-project. org/. 2016.
  • Team R. RStudio: Integrated development environment for R. RStudio, PBC [Internet]. 2020
  • Kuhn M, Quinlan R (2025). C50: C5.0 Decision Trees and Rule-Based Models. R package version 0.2.0.9000, https://github.com/topepo/c5.0.
  • Therneau T, Atkinson B. rpart: Recursive partitioning and regression trees. (No Title). 1999 Apr 8.
  • Liaw A, Wiener M. Classification and regression by randomForest. R news. 2002 Dec 3;2(3):18-22.
  • Meyer D, Dimitriadou E, Hornik K, Weingessel A, Leisch F, Chang CC, Lin CC. e1071: misc functions of the department of statistics, probability theory group (formerly: E1071), TU Wien. R package version. 2019;1(2).
  • Venables WN, Ripley BD. Modern applied statistics with S. Springer Science & Business Media; 2013 Mar 9.
  • Kuhn M, Wickham H. Tidymodels: a collection of packages for modeling and machine learning using tidyverse principles [Internet]. 2020
  • Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, Desmaison A. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems. 2019;32.
  • Güney H. Feature selection‐integrated classifier optimisation algorithm for network intrusion detection. Concurrency and Computation: Practice and Experience. 2023 Oct 25;35(23):e7807.
  • Perlich C, Provost F, Simonoff JS. Tree induction vs. logistic regression: A learning-curve analysis. Journal of Machine Learning Research. 2003;4(Jun):211-55.
  • Hosmer Jr DW, Lemeshow S, Sturdivant RX. Applied logistic regression. John Wiley & Sons; 2013 Feb 26.
  • Güney H. A Fast‐Optimizing and Adaptable Intrusion Detection System Based on Progressively Optimized Support Vector Machines. Concurrency and Computation: Practice and Experience. 2025 Jul 25;37(15-17):e70156.
There are 31 citations in total.

Details

Primary Language English
Subjects Artificial Intelligence (Other)
Journal Section Research Article
Authors

Hammed Ayomide Abdulsalam 0009-0001-4776-2032

Hüseyin Güney 0000-0001-7924-1904

Submission Date August 16, 2024
Acceptance Date November 16, 2025
Publication Date December 31, 2025
Published in Issue Year 2025 Volume: 12 Issue: 4

Cite

Vancouver Abdulsalam HA, Güney H. Investigating the Impact of One-Hot Encoding on Classification Performance and Time Complexity for Efficient Network Intrusion Detection. Hittite J Sci Eng. 2025;12(4):197-20.

Hittite Journal of Science and Engineering is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY NC).