Research Article

Learning From High-Cardinality Categorical Features in Deep Neural Networks

Volume: 8 Number: 2 June 23, 2022
EN

Learning From High-Cardinality Categorical Features in Deep Neural Networks

Abstract

Some machine learning algorithms expect the input variables and the output variables to be numeric. Therefore, in an early stage of modelling, feature engineering is required when categorical variables present in the dataset. As a result, we must encode those attributes into an appropriate feature vector. However, categorical variables having more than 100 unique values are considered to be high-cardinality and there exists no straightforward methods to handle them. Besides, the majority of the work on categorical variable encoding in the literature assumes that the categories is limited, known beforehand, and made up of mutually-exclusive elements, inde-pendently from the data, which is not necessarily true for real-world applications. Feature engineering typically practices to tackle the high cardinality issues with data-cleaning techniques which they are time-consuming and often needs human intervention and domain expertise which are major costs in data science projects The most common methods of transform categorical variables is one-hot encoding and target encoding. To address the issue of encoding categorical variables in environments with a high cardinality, we also seek a general-purpose approach for statistical analysis of categorical entries that is capable of handling a very large number of catego-ries, while avoiding computational and statistical difficulties. Our proposed approach is low dimensional; thus, it is very efficient in processing time and memory, it can be computed in an online learning setting. Even though for this paper, we opt to utilize it in the input layer, dictionaries are typically architecture-independent and may be moved between different architectures or layers.

Keywords

References

  1. Au, T. C. (2018). Random Forests, Decision Trees, and Categorical Predictors: The “Absent Levels” Problem. Journal of Machine Learning Research, 19, 1-30. Retrieved From: https://www.jmlr.org/papers/v19
  2. Bengio, Y. (2012). Practical Recommendations for Gradient-Based Training of Deep Architectures. In G. Montavon, G. B. Orr, & K. R. Müller (Eds.), Neural Networks: Tricks of the Trade. Lecture Notes in Computer Science (Vol. 7700). Berlin, Heidelberg: Springer. DOI: https://doi.org/10.1007/978-3-642-35289-8_26
  3. Bengio, Y., Schwenk, H., Senécal, J.-S., Morin, F., & Gauvain, J.-L. (2006). Neural Probabilistic Language Models. In Holmes D.E., Jain L.C. (Eds.), Innovations in Machine Learning (Vol. 194, pp. 137-186). Berlin, Heidelberg: Springer. DOI: https://doi.org/10.1007/3-540-33486-6_6
  4. Cerda, P., Varoquaux, G., & Kégl, B. (2018). Similarity encoding for learning with dirty categorical variables. Machine Learning, 1477-1494. DOI: https://doi.org/10.1007/s10994-018-5724-2
  5. Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2002). Applied multiple regression/correlation analysis for the behavioral sciences (3rd ed.). Lawrence Erlbaum Associates Publishers. ISBN: 9780203774441
  6. Domingos, P. (2012). A few useful things to know about machine learning. Communications of the ACM, 55(10), 78-87. DOI: https://doi.org/10.1145/2347736.2347755
  7. Feurer, M., Klein, A., Eggensperger, K., Springenberg, J. T., Blum, M., & Hutter, F. (2019). Auto-sklearn: Efficient and Robust Automated Machine Learning. In F. Hutter , L. Kotthoff, & J. Vanschoren (Eds.), Automated Machine Learning. The Springer Series on Challenges in Machine Learning. (pp.113-134). Springer, Cham. DOI: https://doi.org/10.1007/978-3-030-05318-5_6
  8. Géron, A. (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (2nd ed.). O'Reilly Media, Inc. ISBN: 9781492032649

Details

Primary Language

English

Subjects

Artificial Intelligence, Engineering

Journal Section

Research Article

Publication Date

June 23, 2022

Submission Date

October 25, 2021

Acceptance Date

January 18, 2022

Published in Issue

Year 2022 Volume: 8 Number: 2

APA
Arat, M. M. (2022). Learning From High-Cardinality Categorical Features in Deep Neural Networks. Journal of Advanced Research in Natural and Applied Sciences, 8(2), 222-236. https://doi.org/10.28979/jarnas.1014469
AMA
1.Arat MM. Learning From High-Cardinality Categorical Features in Deep Neural Networks. JARNAS. 2022;8(2):222-236. doi:10.28979/jarnas.1014469
Chicago
Arat, Mustafa Murat. 2022. “Learning From High-Cardinality Categorical Features in Deep Neural Networks”. Journal of Advanced Research in Natural and Applied Sciences 8 (2): 222-36. https://doi.org/10.28979/jarnas.1014469.
EndNote
Arat MM (June 1, 2022) Learning From High-Cardinality Categorical Features in Deep Neural Networks. Journal of Advanced Research in Natural and Applied Sciences 8 2 222–236.
IEEE
[1]M. M. Arat, “Learning From High-Cardinality Categorical Features in Deep Neural Networks”, JARNAS, vol. 8, no. 2, pp. 222–236, June 2022, doi: 10.28979/jarnas.1014469.
ISNAD
Arat, Mustafa Murat. “Learning From High-Cardinality Categorical Features in Deep Neural Networks”. Journal of Advanced Research in Natural and Applied Sciences 8/2 (June 1, 2022): 222-236. https://doi.org/10.28979/jarnas.1014469.
JAMA
1.Arat MM. Learning From High-Cardinality Categorical Features in Deep Neural Networks. JARNAS. 2022;8:222–236.
MLA
Arat, Mustafa Murat. “Learning From High-Cardinality Categorical Features in Deep Neural Networks”. Journal of Advanced Research in Natural and Applied Sciences, vol. 8, no. 2, June 2022, pp. 222-36, doi:10.28979/jarnas.1014469.
Vancouver
1.Mustafa Murat Arat. Learning From High-Cardinality Categorical Features in Deep Neural Networks. JARNAS. 2022 Jun. 1;8(2):222-36. doi:10.28979/jarnas.1014469

Cited By

 

 

 

TR Dizin 20466
 

 

SAO/NASA Astrophysics Data System (ADS)    34270

                                                   American Chemical Society-Chemical Abstracts Service CAS    34922 

 

DOAJ 32869

EBSCO 32870

Scilit 30371                        

SOBİAD 20460

 

29804 JARNAS is licensed under a Creative Commons Attribution-NonCommercial 4.0 International Licence (CC BY-NC).