Research Article
BibTex RIS Cite
Year 2022, , 222 - 236, 23.06.2022
https://doi.org/10.28979/jarnas.1014469

Abstract

References

  • Au, T. C. (2018). Random Forests, Decision Trees, and Categorical Predictors: The “Absent Levels” Problem. Journal of Machine Learning Research, 19, 1-30. Retrieved From: https://www.jmlr.org/papers/v19
  • Bengio, Y. (2012). Practical Recommendations for Gradient-Based Training of Deep Architectures. In G. Montavon, G. B. Orr, & K. R. Müller (Eds.), Neural Networks: Tricks of the Trade. Lecture Notes in Computer Science (Vol. 7700). Berlin, Heidelberg: Springer. DOI: https://doi.org/10.1007/978-3-642-35289-8_26
  • Bengio, Y., Schwenk, H., Senécal, J.-S., Morin, F., & Gauvain, J.-L. (2006). Neural Probabilistic Language Models. In Holmes D.E., Jain L.C. (Eds.), Innovations in Machine Learning (Vol. 194, pp. 137-186). Berlin, Heidelberg: Springer. DOI: https://doi.org/10.1007/3-540-33486-6_6
  • Cerda, P., Varoquaux, G., & Kégl, B. (2018). Similarity encoding for learning with dirty categorical variables. Machine Learning, 1477-1494. DOI: https://doi.org/10.1007/s10994-018-5724-2
  • Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2002). Applied multiple regression/correlation analysis for the behavioral sciences (3rd ed.). Lawrence Erlbaum Associates Publishers. ISBN: 9780203774441
  • Domingos, P. (2012). A few useful things to know about machine learning. Communications of the ACM, 55(10), 78-87. DOI: https://doi.org/10.1145/2347736.2347755
  • Feurer, M., Klein, A., Eggensperger, K., Springenberg, J. T., Blum, M., & Hutter, F. (2019). Auto-sklearn: Efficient and Robust Automated Machine Learning. In F. Hutter , L. Kotthoff, & J. Vanschoren (Eds.), Automated Machine Learning. The Springer Series on Challenges in Machine Learning. (pp.113-134). Springer, Cham. DOI: https://doi.org/10.1007/978-3-030-05318-5_6
  • Géron, A. (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (2nd ed.). O'Reilly Media, Inc. ISBN: 9781492032649
  • Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, 9, 249-256. Retrieved From: https://proceedings.mlr.press/v9/glorot10a.html
  • Glorot, X., Border, A., & Bengio, Y. (2011). Deep Sparse Rectifier Neural Networks. Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 15, 315-323. Retrieved From: https://proceedings.mlr.press/v15/glorot11a.html
  • Guo, C., & Berkhahn, F. (2016). Entity Embeddings of Categorical Variables. arXiv. Retrieved From: https://arxiv.org/abs/1604.06737
  • Hand, D. J., & Henley, W. E. (1997). Statistical Classification Methods in Consumer Credit Scoring: A Review. Journal of the Royal Statistical Society. Series A (Statistics in Society), 160(3), 523-541. DOI: https://doi.org/10.1111/j.1467-985X.1997.00078.x
  • Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Liu, T.-Y. (2017). LightGBM: A Highly Efficient Gradient Boosting Decision Tree. Proceedings of the 31st International Conference on Neural Information Processing Systems, 3149–3157. Retrieved From: https://papers.nips.cc/paper/2017/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html
  • Kingma, D. P., & Ba, J. (2015). Adam: A Method for Stochastic Optimization. The Proceedings of 3rd International Conference on Learning Representation. San Diego, CA, USA. Retrieved From: https://arxiv.org/abs/1412.6980
  • LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521, 436–444. DOI: https://doi.org/10.1038/nature14539
  • Levy, O., & Goldberg, Y. (2014). Neural word embedding as implicit matrix factorization. Proceedings of the 27th International Conference on Neural Information Processing Systems, 2177-2185. Retrieved From: https://papers.nips.cc/paper/2014/hash/feab05aa91085b7a8012516bc3533958-Abstract.html
  • Li, Y., & Yang, T. (2018). Word Embedding for Understanding Natural Language: A Survey. In S. Srinivasan (Ed.), Guide to Big Data Applications (pp. 83-104). Springer, Cham. DOI: https://doi.org/10.1007/978-3-319-53817-4_4
  • Micci-Barreca, D. (2001). A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems. SIGKDD Explor. Newsl., 3(1), 27-32. DOI: https://doi.org/10.1145/507533.507538
  • Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. Proceedings of 1st International Conference on Learning Representations. Scottsdale, Arizona, USA. Retrieved From: https://arxiv.org/abs/1301.3781
  • Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Proceedings of the 26th International Conference on Neural Information Processing Systems, 3111-3119. Retrieved From: https://papers.nips.cc/paper/2013/hash/9aa42b31882ec039965f3c4923ce901b-Abstract.html
  • Mnih, A., & Kavukcuoglu, K. (2013). Learning word embeddings efficiently with noise-contrastive estimation. Proceedings of the 26th International Conference on Neural Information Processing Systems, 2265-2273. Retrieved From: https://proceedings.neurips.cc/paper/2013/hash/db2b4182156b2f1f817860ac9f409ad7-Abstract.html
  • Moeyersoms, J., & Martens, D. (2015). Including high-cardinality attributes in predictive models: A case study in churn prediction in the energy sector. Decision Support Systems, 72, 72-81. DOI: https://doi.org/10.1016/j.dss.2015.02.007
  • Pennington, J., Socher, R., & Manning, C. (2014). GloVe: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 1532-1543. DOI: https://doi.org/10.3115/v1/D14-1162
  • Perlich, C., & Provost, F. (2006). Distribution-based aggregation for relational learning with identifier attributes. Machine Learning, 62(1), 65-105. DOI: https://doi.org/10.1007/s10994-006-6064-1
  • Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., & Gulin, A. (2018). CatBoost: unbiased boosting with categorical features. Proceedings of the 32nd International Conference on Neural Information Processing Systems, 6639–6649. Retrieved From: https://arxiv.org/abs/1706.09516
  • Rahm, E., & Do, H.-H. (2000). Data Cleaning: Problems and Current Approaches. IEEE Data Engineering Bulletin, 23, 3-13. Retrieved From: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.98.8661
  • Rudolph, M., Ruiz, F. J., Mandt, S., & Blei, D. M. (2016). Exponential family embeddings. Proceedings of the 30th International Conference on Neural Information Processing Systems, 478-486. Retrieved From: https://papers.nips.cc/paper/2016/hash/06138bc5af6023646ede0e1f7c1eac75-Abstract.html
  • Rumelhart, D., Hinton, G. & Williams, R. (1986) Learning representations by back-propagating errors. Nature, 323, 533-536. DOI: https://doi.org/10.1038/323533a0
  • Russell, S., & Norvig, P. (2020). Artificial Intelligence: A Modern Approach (4th Edition ed.). Pearson. ISBN: 0134610997
  • Suits, D. B. (1957). Use of Dummy Variables in Regression Equations. Journal of the American Statistical Association, 52(280), 548-551. DOI: https://doi.org/10.2307/2281705
  • Thomas, J., Coors, S., & Bischl, B. (2018). Automatic Gradient Boosting. ArXiv. Retrieved From: https://arxiv.org/abs/1807.03873
  • Thornton, C., Hutter, F., Hoos, H. H., & Leyton-Brown, K. (2013 ). Auto-WEKA: combined selection and hyperparameter optimization of classification algorithms. Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, 847–855. DOI: https://doi.org/10.1145/2487575.2487629
  • Weinberger, K., Dasgupta, A., Langford, J., Smola, A., & Attenberg, J. (2009). Feature hashing for large scale multitask learning. Proceedings of the 26th Annual International Conference on Machine Learning, 1113–1120. DOI: https://doi.org/10.1145/1553374.1553516
  • Xu, W., Evans, D., & Qi, Y. (2018). Feature Squeezing: Detecting Adversarial Examples in Deep Neural Networks. Proceedings of 2018 Network and Distributed System Security Symposium. Retrieved From: https://arxiv.org/abs/1704.01155
  • Yin, Z., & Shen, Y. (2018). On the dimensionality of word embedding. Proceedings of the 32nd International Conference on Neural Information Processing Systems, 895-906. Retrieved From: https://arxiv.org/abs/1812.04224

Learning From High-Cardinality Categorical Features in Deep Neural Networks

Year 2022, , 222 - 236, 23.06.2022
https://doi.org/10.28979/jarnas.1014469

Abstract

Some machine learning algorithms expect the input variables and the output variables to be numeric. Therefore, in an early stage of modelling, feature engineering is required when categorical variables present in the dataset. As a result, we must encode those attributes into an appropriate feature vector. However, categorical variables having more than 100 unique values are considered to be high-cardinality and there exists no straightforward methods to handle them. Besides, the majority of the work on categorical variable encoding in the literature assumes that the categories is limited, known beforehand, and made up of mutually-exclusive elements, inde-pendently from the data, which is not necessarily true for real-world applications. Feature engineering typically practices to tackle the high cardinality issues with data-cleaning techniques which they are time-consuming and often needs human intervention and domain expertise which are major costs in data science projects The most common methods of transform categorical variables is one-hot encoding and target encoding. To address the issue of encoding categorical variables in environments with a high cardinality, we also seek a general-purpose approach for statistical analysis of categorical entries that is capable of handling a very large number of catego-ries, while avoiding computational and statistical difficulties. Our proposed approach is low dimensional; thus, it is very efficient in processing time and memory, it can be computed in an online learning setting. Even though for this paper, we opt to utilize it in the input layer, dictionaries are typically architecture-independent and may be moved between different architectures or layers.

References

  • Au, T. C. (2018). Random Forests, Decision Trees, and Categorical Predictors: The “Absent Levels” Problem. Journal of Machine Learning Research, 19, 1-30. Retrieved From: https://www.jmlr.org/papers/v19
  • Bengio, Y. (2012). Practical Recommendations for Gradient-Based Training of Deep Architectures. In G. Montavon, G. B. Orr, & K. R. Müller (Eds.), Neural Networks: Tricks of the Trade. Lecture Notes in Computer Science (Vol. 7700). Berlin, Heidelberg: Springer. DOI: https://doi.org/10.1007/978-3-642-35289-8_26
  • Bengio, Y., Schwenk, H., Senécal, J.-S., Morin, F., & Gauvain, J.-L. (2006). Neural Probabilistic Language Models. In Holmes D.E., Jain L.C. (Eds.), Innovations in Machine Learning (Vol. 194, pp. 137-186). Berlin, Heidelberg: Springer. DOI: https://doi.org/10.1007/3-540-33486-6_6
  • Cerda, P., Varoquaux, G., & Kégl, B. (2018). Similarity encoding for learning with dirty categorical variables. Machine Learning, 1477-1494. DOI: https://doi.org/10.1007/s10994-018-5724-2
  • Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2002). Applied multiple regression/correlation analysis for the behavioral sciences (3rd ed.). Lawrence Erlbaum Associates Publishers. ISBN: 9780203774441
  • Domingos, P. (2012). A few useful things to know about machine learning. Communications of the ACM, 55(10), 78-87. DOI: https://doi.org/10.1145/2347736.2347755
  • Feurer, M., Klein, A., Eggensperger, K., Springenberg, J. T., Blum, M., & Hutter, F. (2019). Auto-sklearn: Efficient and Robust Automated Machine Learning. In F. Hutter , L. Kotthoff, & J. Vanschoren (Eds.), Automated Machine Learning. The Springer Series on Challenges in Machine Learning. (pp.113-134). Springer, Cham. DOI: https://doi.org/10.1007/978-3-030-05318-5_6
  • Géron, A. (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (2nd ed.). O'Reilly Media, Inc. ISBN: 9781492032649
  • Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, 9, 249-256. Retrieved From: https://proceedings.mlr.press/v9/glorot10a.html
  • Glorot, X., Border, A., & Bengio, Y. (2011). Deep Sparse Rectifier Neural Networks. Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 15, 315-323. Retrieved From: https://proceedings.mlr.press/v15/glorot11a.html
  • Guo, C., & Berkhahn, F. (2016). Entity Embeddings of Categorical Variables. arXiv. Retrieved From: https://arxiv.org/abs/1604.06737
  • Hand, D. J., & Henley, W. E. (1997). Statistical Classification Methods in Consumer Credit Scoring: A Review. Journal of the Royal Statistical Society. Series A (Statistics in Society), 160(3), 523-541. DOI: https://doi.org/10.1111/j.1467-985X.1997.00078.x
  • Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Liu, T.-Y. (2017). LightGBM: A Highly Efficient Gradient Boosting Decision Tree. Proceedings of the 31st International Conference on Neural Information Processing Systems, 3149–3157. Retrieved From: https://papers.nips.cc/paper/2017/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html
  • Kingma, D. P., & Ba, J. (2015). Adam: A Method for Stochastic Optimization. The Proceedings of 3rd International Conference on Learning Representation. San Diego, CA, USA. Retrieved From: https://arxiv.org/abs/1412.6980
  • LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521, 436–444. DOI: https://doi.org/10.1038/nature14539
  • Levy, O., & Goldberg, Y. (2014). Neural word embedding as implicit matrix factorization. Proceedings of the 27th International Conference on Neural Information Processing Systems, 2177-2185. Retrieved From: https://papers.nips.cc/paper/2014/hash/feab05aa91085b7a8012516bc3533958-Abstract.html
  • Li, Y., & Yang, T. (2018). Word Embedding for Understanding Natural Language: A Survey. In S. Srinivasan (Ed.), Guide to Big Data Applications (pp. 83-104). Springer, Cham. DOI: https://doi.org/10.1007/978-3-319-53817-4_4
  • Micci-Barreca, D. (2001). A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems. SIGKDD Explor. Newsl., 3(1), 27-32. DOI: https://doi.org/10.1145/507533.507538
  • Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. Proceedings of 1st International Conference on Learning Representations. Scottsdale, Arizona, USA. Retrieved From: https://arxiv.org/abs/1301.3781
  • Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Proceedings of the 26th International Conference on Neural Information Processing Systems, 3111-3119. Retrieved From: https://papers.nips.cc/paper/2013/hash/9aa42b31882ec039965f3c4923ce901b-Abstract.html
  • Mnih, A., & Kavukcuoglu, K. (2013). Learning word embeddings efficiently with noise-contrastive estimation. Proceedings of the 26th International Conference on Neural Information Processing Systems, 2265-2273. Retrieved From: https://proceedings.neurips.cc/paper/2013/hash/db2b4182156b2f1f817860ac9f409ad7-Abstract.html
  • Moeyersoms, J., & Martens, D. (2015). Including high-cardinality attributes in predictive models: A case study in churn prediction in the energy sector. Decision Support Systems, 72, 72-81. DOI: https://doi.org/10.1016/j.dss.2015.02.007
  • Pennington, J., Socher, R., & Manning, C. (2014). GloVe: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 1532-1543. DOI: https://doi.org/10.3115/v1/D14-1162
  • Perlich, C., & Provost, F. (2006). Distribution-based aggregation for relational learning with identifier attributes. Machine Learning, 62(1), 65-105. DOI: https://doi.org/10.1007/s10994-006-6064-1
  • Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., & Gulin, A. (2018). CatBoost: unbiased boosting with categorical features. Proceedings of the 32nd International Conference on Neural Information Processing Systems, 6639–6649. Retrieved From: https://arxiv.org/abs/1706.09516
  • Rahm, E., & Do, H.-H. (2000). Data Cleaning: Problems and Current Approaches. IEEE Data Engineering Bulletin, 23, 3-13. Retrieved From: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.98.8661
  • Rudolph, M., Ruiz, F. J., Mandt, S., & Blei, D. M. (2016). Exponential family embeddings. Proceedings of the 30th International Conference on Neural Information Processing Systems, 478-486. Retrieved From: https://papers.nips.cc/paper/2016/hash/06138bc5af6023646ede0e1f7c1eac75-Abstract.html
  • Rumelhart, D., Hinton, G. & Williams, R. (1986) Learning representations by back-propagating errors. Nature, 323, 533-536. DOI: https://doi.org/10.1038/323533a0
  • Russell, S., & Norvig, P. (2020). Artificial Intelligence: A Modern Approach (4th Edition ed.). Pearson. ISBN: 0134610997
  • Suits, D. B. (1957). Use of Dummy Variables in Regression Equations. Journal of the American Statistical Association, 52(280), 548-551. DOI: https://doi.org/10.2307/2281705
  • Thomas, J., Coors, S., & Bischl, B. (2018). Automatic Gradient Boosting. ArXiv. Retrieved From: https://arxiv.org/abs/1807.03873
  • Thornton, C., Hutter, F., Hoos, H. H., & Leyton-Brown, K. (2013 ). Auto-WEKA: combined selection and hyperparameter optimization of classification algorithms. Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, 847–855. DOI: https://doi.org/10.1145/2487575.2487629
  • Weinberger, K., Dasgupta, A., Langford, J., Smola, A., & Attenberg, J. (2009). Feature hashing for large scale multitask learning. Proceedings of the 26th Annual International Conference on Machine Learning, 1113–1120. DOI: https://doi.org/10.1145/1553374.1553516
  • Xu, W., Evans, D., & Qi, Y. (2018). Feature Squeezing: Detecting Adversarial Examples in Deep Neural Networks. Proceedings of 2018 Network and Distributed System Security Symposium. Retrieved From: https://arxiv.org/abs/1704.01155
  • Yin, Z., & Shen, Y. (2018). On the dimensionality of word embedding. Proceedings of the 32nd International Conference on Neural Information Processing Systems, 895-906. Retrieved From: https://arxiv.org/abs/1812.04224
There are 35 citations in total.

Details

Primary Language English
Subjects Artificial Intelligence, Engineering
Journal Section Research Article
Authors

Mustafa Murat Arat 0000-0003-3740-5135

Publication Date June 23, 2022
Submission Date October 25, 2021
Published in Issue Year 2022

Cite

APA Arat, M. M. (2022). Learning From High-Cardinality Categorical Features in Deep Neural Networks. Journal of Advanced Research in Natural and Applied Sciences, 8(2), 222-236. https://doi.org/10.28979/jarnas.1014469
AMA Arat MM. Learning From High-Cardinality Categorical Features in Deep Neural Networks. JARNAS. June 2022;8(2):222-236. doi:10.28979/jarnas.1014469
Chicago Arat, Mustafa Murat. “Learning From High-Cardinality Categorical Features in Deep Neural Networks”. Journal of Advanced Research in Natural and Applied Sciences 8, no. 2 (June 2022): 222-36. https://doi.org/10.28979/jarnas.1014469.
EndNote Arat MM (June 1, 2022) Learning From High-Cardinality Categorical Features in Deep Neural Networks. Journal of Advanced Research in Natural and Applied Sciences 8 2 222–236.
IEEE M. M. Arat, “Learning From High-Cardinality Categorical Features in Deep Neural Networks”, JARNAS, vol. 8, no. 2, pp. 222–236, 2022, doi: 10.28979/jarnas.1014469.
ISNAD Arat, Mustafa Murat. “Learning From High-Cardinality Categorical Features in Deep Neural Networks”. Journal of Advanced Research in Natural and Applied Sciences 8/2 (June 2022), 222-236. https://doi.org/10.28979/jarnas.1014469.
JAMA Arat MM. Learning From High-Cardinality Categorical Features in Deep Neural Networks. JARNAS. 2022;8:222–236.
MLA Arat, Mustafa Murat. “Learning From High-Cardinality Categorical Features in Deep Neural Networks”. Journal of Advanced Research in Natural and Applied Sciences, vol. 8, no. 2, 2022, pp. 222-36, doi:10.28979/jarnas.1014469.
Vancouver Arat MM. Learning From High-Cardinality Categorical Features in Deep Neural Networks. JARNAS. 2022;8(2):222-36.


TR Dizin 20466


DOAJ 32869



Scilit 30371                        

SOBİAD 20460


29804 JARNAS is licensed under a Creative Commons Attribution-NonCommercial 4.0 International Licence (CC BY-NC).