A New Feature Selection Metric Based on Rough Sets and Information Gain in Text Classification

Rasim Çekik; Mahmut Kaya

doi:10.54287/gujsa.1379024

Research Article

A New Feature Selection Metric Based on Rough Sets and Information Gain in Text Classification

Year 2023, Volume: 10 Issue: 4, 472 - 486, 31.12.2023

Rasim Çekik , Mahmut Kaya

https://doi.org/10.54287/gujsa.1379024

Cited By: 3

Abstract

In text classification, taking words in text documents as features creates a very high dimensional feature space. This is known as the high dimensionality problem in text classification. The most common and effective way to solve this problem is to select an ideal subset of features using a feature selection approach. In this paper, a new feature selection approach called Rough Information Gain (RIG) is presented as a solution to the high dimensionality problem. Rough Information Gain extracts hidden and meaningful patterns in text data with the help of Rough Sets and computes a score value based on these patterns. The proposed approach utilizes the selection strategy of the Information Gain Selection (IG) approach when pattern extraction is completely uncertain. To demonstrate the performance of the Rough Information Gain in the experimental studies, the Micro-F1 success metric is used to compare with Information Gain Selection (IG), Chi-Square (CHI2), Gini Coefficient (GI), Discriminative Feature Selector (DFS) approaches. The proposed Rough Information Gain approach outperforms the other methods in terms of performance, according to the results.

Keywords

Feature Selection , Text Classification , Rough Set , Dimensionality Reduction

References

Aggarwal, C., & Zhai, C. (2012). A survey of text classification algorithms. In: C. C. Aggarwal, & C Zhai (Eds.), Mining text data (pp. 163-222). https://doi.org/10.1007/978-1-4614-3223-4_6
Alberto, T. C., Lochter, J. V., & Almeida, T. A. (2015, December 9-11). Tubespam: Comment spam filtering on youtube. In: Proceedings of the IEEE 14th International Conference on Machine Learning and Applications (ICMLA), Miami, Florida. https://doi.org/10.1109/ICMLA.2015.37
Bermejo, P., De la Ossa, L., G´amez, J., & Puerta, J. (2012). Fast wrapper feature subset selection in highdimensional datasets by means of filter re-ranking. Knowledge Based Systems, 25(1), 35-44. https://doi.org/10.1016/j.knosys.2011.01.015
Cekik, R., & Uysal, A. K. (2020). A novel filter feature selection method using rough set for short text data. Expert Systems with Applications, 160, 113691. https://doi.org/10.1016/j.eswa.2020.113691
Cekik, R., & Uysal, A. K. (2022). A new metric for feature selection on short text datasets. Concurrency and Computation: Practice and Experience, 34(13), e6909. https://doi.org/10.1002/cpe.6909
Chen, J., Huang, H., Tian, S., & Qu, Y. (2009). Feature selection for text classification with Naïve Bayes. Expert Systems with Applications, 36(3), 5432-5435. https://doi.org/10.1016/j.eswa.2008.06.054
Chou, C., Sinha, A., & Zhao, H. (2010). A hybrid attribute selection approach for text classification. Journal of the Association for Information Systems, 11(9), 491. https://doi.org/10.17705/1jais.00236
Ghareb, A., Bakar, A., & Hamdan, A. (2016). Hybrid feature selection based on enhanced genetic algorithm for text categorization. Expert Systems with Applications, 49, 31-47. https://doi.org/10.1016/j.eswa.2015.12.004
Gutlein, M., Frank, E., Hall, M., & Karwath, A. (2009, March 30 - April 2). Large-scale attribute selection using wrappers. In: Proceedings of the IEEE Symposium on Computational Intelligence and Data Mining, (pp. 332-339), Nashville, TN. https://doi.org/10.1109/CIDM.2009.4938668
Joachims, T. (1998, April 21-23). Text categorization with support vector machines: Learning with many relevant features. In: Proceedings of the European conference on machine learning (pp. 137-142). Berlin, Heidelberg. https://doi.org/10.1007/BFb0026683
Kaya, M., Bi̇lge, H. Ş., & Yildiz, O. (2013, April 24-26). Feature selection and dimensionality reduction on gene expressions. In: Proceedings of the 21st Signal Processing and Communications Applications Conference (SIU) (pp. 1-4), Haspolat. https://doi.org/10.1109/siu.2013.6531476
Kaya, M., & Bi̇lge, H. Ş. (2016, May 16-19). A hybrid feature selection approach based on statistical and wrapper methods. In: Proceedings of the 24th Signal Processing and Communication Application Conference (SIU) (pp. 2101-2104), Zonguldak. https://doi.org/10.1109/SIU.2016.7496186
Kowsari, K., Jafari Meimandi, K., Heidarysafa, M., Mendu, S., Barnes, L., & Brown, D. (2019). Text classification algorithms: A survey. Information, 10(4), 150. https://doi.org/10.3390/info10040150
Labani, M., Moradi, P., Ahmadizar, F., & Jalili, M. (2018). A novel multivariate filter method for feature selection in text classification problems. Engineering Applications of Artificial Intelligence, 70, 25-37. https://doi.org/10.1016/j.engappai.2017.12.014
Nuruzzaman, M. T., Lee, C., & Choi, D. (2011, August 31 - September 2). Independent and Personal SMS Spam Filtering. In: Proceedings of the IEEE 11th International Conference on Computer and Information Technology, (pp. 429-435), Paphos. https://doi.org/10.1109/CIT.2011.23
Ogura, H., Amano, H., & Kondo, M. (2009). Feature selection with a measure of deviations from Poisson in text categorization. Expert Systems with Applications, 36(3), 6826-6832. https://doi.org/10.1016/j.eswa.2008.08.006
Pawlak, Z. (1998). Rough set theory and its applications to data analysis. Cybernetics & Systems, 29(7), 661-688. https://doi.org/10.1080/019697298125470
Pearson, E. (1925). Bayes’ theorem, examined in the light of experimental sampling. Biometrika, 17(3-4), 388-442. https://doi.org/10.1093/biomet/17.3-4.388
Rehman, A., Javed, K., Babri, H. A., & Saeed, M. (2015). Relative discrimination criterion–A novel feature ranking method for text data. Expert Systems with Applications, 42(7), 3670-3681. https://doi.org/10.1016/j.eswa.2014.12.013
Rehman, A., Javed, K., & Babri, H. A. (2017). Feature selection based on a normalized difference measure for text classification. Information Processing & Management, 53(2), 473-489. https://doi.org/10.1016/j.ipm.2016.12.004
Rehman, A., Javed, K., Babri, H. A., & Asim, M. N. (2018). Selection of the most relevant terms based on a max-min ratio metric for text classification. Expert Systems with Applications, 114, 78-96. https://doi.org/10.1016/j.eswa.2018.07.028
Shang, W., Huang, H., Zhu, H., Lin, Y., Qu, Y., & Wang, Z. (2007). A novel feature selection algorithm for text categorization. Expert Systems with Applications, 33(1), 1-5. https://doi.org/10.1016/j.eswa.2006.04.001
Shang, C., Li, M., Feng, S., Jiang, Q., & Fan, J. (2013). Feature selection via maximizing global information gain for text classification. Knowledge-Based Systems, 54, 298-309. https://doi.org/10.1016/j.knosys.2013.09.019
Sharmin, S., Shoyaib, M., Ali, A. A., Khan, M. A., & Chae, O. (2019). Simultaneous feature selection and discretization based on mutual information. Pattern Recognition, 91, 162-174. https://doi.org/10.1016/j.patcog.2019.02.016
Şenol, A. (2023). Comparison of Performance of Classification Algorithms Using Standard Deviation-based Feature Selection in Cyber Attack Datasets. International Journal of Pure and Applied Sciences, 9(1), 209-222. https://doi.org/10.29132/ijpas.1278880
Uysal, A. K., & Gunal, S. (2012). A novel probabilistic feature selection method for text classification. Knowledge-Based Systems, 36, 226-235. https://doi.org/10.1016/j.knosys.2012.06.005
Wang, H., & Hong, M. (2019). Supervised Hebb rule based feature selection for text classification. Information Processing & Management, 56(1), 167-191. https://doi.org/10.1016/j.ipm.2018.09.004
Wang, S., Li, D., Wei, Y., & Li, H. (2009). A feature selection method based on fisher’s discriminant ratio for text sentiment classification. In: Proceedings of the International Conference on Web Information Systems and Mining (pp. 88-97). Berlin. https://doi.org/10.1007/978-3-642-05250-7_10
Weston, J., Mukherjee, S., Chapelle, O., Pontil, M., Poggio, T., & Vapni, V. (2001). Feature selection for SVMs. Advances in neural information processing systems, Denver, CO (pp. 668-674).
Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. 14th International Conference on Machine Learning, Nashville, USA, (pp. 412-420).
Zhang, Q., Xie, Q., & Wang, G. (2016). A survey on rough set theory and its applications. CAAI Transactions on Intelligence Technology, 1(4), 323-333. https://doi.org/10.1016/j.trit.2016.11.001

Year 2023, Volume: 10 Issue: 4, 472 - 486, 31.12.2023

Rasim Çekik , Mahmut Kaya

https://doi.org/10.54287/gujsa.1379024

Cited By: 3

Abstract

References

Aggarwal, C., & Zhai, C. (2012). A survey of text classification algorithms. In: C. C. Aggarwal, & C Zhai (Eds.), Mining text data (pp. 163-222). https://doi.org/10.1007/978-1-4614-3223-4_6
Alberto, T. C., Lochter, J. V., & Almeida, T. A. (2015, December 9-11). Tubespam: Comment spam filtering on youtube. In: Proceedings of the IEEE 14th International Conference on Machine Learning and Applications (ICMLA), Miami, Florida. https://doi.org/10.1109/ICMLA.2015.37
Bermejo, P., De la Ossa, L., G´amez, J., & Puerta, J. (2012). Fast wrapper feature subset selection in highdimensional datasets by means of filter re-ranking. Knowledge Based Systems, 25(1), 35-44. https://doi.org/10.1016/j.knosys.2011.01.015
Cekik, R., & Uysal, A. K. (2020). A novel filter feature selection method using rough set for short text data. Expert Systems with Applications, 160, 113691. https://doi.org/10.1016/j.eswa.2020.113691
Cekik, R., & Uysal, A. K. (2022). A new metric for feature selection on short text datasets. Concurrency and Computation: Practice and Experience, 34(13), e6909. https://doi.org/10.1002/cpe.6909
Chen, J., Huang, H., Tian, S., & Qu, Y. (2009). Feature selection for text classification with Naïve Bayes. Expert Systems with Applications, 36(3), 5432-5435. https://doi.org/10.1016/j.eswa.2008.06.054
Chou, C., Sinha, A., & Zhao, H. (2010). A hybrid attribute selection approach for text classification. Journal of the Association for Information Systems, 11(9), 491. https://doi.org/10.17705/1jais.00236
Ghareb, A., Bakar, A., & Hamdan, A. (2016). Hybrid feature selection based on enhanced genetic algorithm for text categorization. Expert Systems with Applications, 49, 31-47. https://doi.org/10.1016/j.eswa.2015.12.004
Gutlein, M., Frank, E., Hall, M., & Karwath, A. (2009, March 30 - April 2). Large-scale attribute selection using wrappers. In: Proceedings of the IEEE Symposium on Computational Intelligence and Data Mining, (pp. 332-339), Nashville, TN. https://doi.org/10.1109/CIDM.2009.4938668
Joachims, T. (1998, April 21-23). Text categorization with support vector machines: Learning with many relevant features. In: Proceedings of the European conference on machine learning (pp. 137-142). Berlin, Heidelberg. https://doi.org/10.1007/BFb0026683
Kaya, M., Bi̇lge, H. Ş., & Yildiz, O. (2013, April 24-26). Feature selection and dimensionality reduction on gene expressions. In: Proceedings of the 21st Signal Processing and Communications Applications Conference (SIU) (pp. 1-4), Haspolat. https://doi.org/10.1109/siu.2013.6531476
Kaya, M., & Bi̇lge, H. Ş. (2016, May 16-19). A hybrid feature selection approach based on statistical and wrapper methods. In: Proceedings of the 24th Signal Processing and Communication Application Conference (SIU) (pp. 2101-2104), Zonguldak. https://doi.org/10.1109/SIU.2016.7496186
Kowsari, K., Jafari Meimandi, K., Heidarysafa, M., Mendu, S., Barnes, L., & Brown, D. (2019). Text classification algorithms: A survey. Information, 10(4), 150. https://doi.org/10.3390/info10040150
Labani, M., Moradi, P., Ahmadizar, F., & Jalili, M. (2018). A novel multivariate filter method for feature selection in text classification problems. Engineering Applications of Artificial Intelligence, 70, 25-37. https://doi.org/10.1016/j.engappai.2017.12.014
Nuruzzaman, M. T., Lee, C., & Choi, D. (2011, August 31 - September 2). Independent and Personal SMS Spam Filtering. In: Proceedings of the IEEE 11th International Conference on Computer and Information Technology, (pp. 429-435), Paphos. https://doi.org/10.1109/CIT.2011.23
Ogura, H., Amano, H., & Kondo, M. (2009). Feature selection with a measure of deviations from Poisson in text categorization. Expert Systems with Applications, 36(3), 6826-6832. https://doi.org/10.1016/j.eswa.2008.08.006
Pawlak, Z. (1998). Rough set theory and its applications to data analysis. Cybernetics & Systems, 29(7), 661-688. https://doi.org/10.1080/019697298125470
Pearson, E. (1925). Bayes’ theorem, examined in the light of experimental sampling. Biometrika, 17(3-4), 388-442. https://doi.org/10.1093/biomet/17.3-4.388
Rehman, A., Javed, K., Babri, H. A., & Saeed, M. (2015). Relative discrimination criterion–A novel feature ranking method for text data. Expert Systems with Applications, 42(7), 3670-3681. https://doi.org/10.1016/j.eswa.2014.12.013
Rehman, A., Javed, K., & Babri, H. A. (2017). Feature selection based on a normalized difference measure for text classification. Information Processing & Management, 53(2), 473-489. https://doi.org/10.1016/j.ipm.2016.12.004
Rehman, A., Javed, K., Babri, H. A., & Asim, M. N. (2018). Selection of the most relevant terms based on a max-min ratio metric for text classification. Expert Systems with Applications, 114, 78-96. https://doi.org/10.1016/j.eswa.2018.07.028
Shang, W., Huang, H., Zhu, H., Lin, Y., Qu, Y., & Wang, Z. (2007). A novel feature selection algorithm for text categorization. Expert Systems with Applications, 33(1), 1-5. https://doi.org/10.1016/j.eswa.2006.04.001
Shang, C., Li, M., Feng, S., Jiang, Q., & Fan, J. (2013). Feature selection via maximizing global information gain for text classification. Knowledge-Based Systems, 54, 298-309. https://doi.org/10.1016/j.knosys.2013.09.019
Sharmin, S., Shoyaib, M., Ali, A. A., Khan, M. A., & Chae, O. (2019). Simultaneous feature selection and discretization based on mutual information. Pattern Recognition, 91, 162-174. https://doi.org/10.1016/j.patcog.2019.02.016
Şenol, A. (2023). Comparison of Performance of Classification Algorithms Using Standard Deviation-based Feature Selection in Cyber Attack Datasets. International Journal of Pure and Applied Sciences, 9(1), 209-222. https://doi.org/10.29132/ijpas.1278880
Uysal, A. K., & Gunal, S. (2012). A novel probabilistic feature selection method for text classification. Knowledge-Based Systems, 36, 226-235. https://doi.org/10.1016/j.knosys.2012.06.005
Wang, H., & Hong, M. (2019). Supervised Hebb rule based feature selection for text classification. Information Processing & Management, 56(1), 167-191. https://doi.org/10.1016/j.ipm.2018.09.004
Wang, S., Li, D., Wei, Y., & Li, H. (2009). A feature selection method based on fisher’s discriminant ratio for text sentiment classification. In: Proceedings of the International Conference on Web Information Systems and Mining (pp. 88-97). Berlin. https://doi.org/10.1007/978-3-642-05250-7_10
Weston, J., Mukherjee, S., Chapelle, O., Pontil, M., Poggio, T., & Vapni, V. (2001). Feature selection for SVMs. Advances in neural information processing systems, Denver, CO (pp. 668-674).
Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. 14th International Conference on Machine Learning, Nashville, USA, (pp. 412-420).
Zhang, Q., Xie, Q., & Wang, G. (2016). A survey on rough set theory and its applications. CAAI Transactions on Intelligence Technology, 1(4), 323-333. https://doi.org/10.1016/j.trit.2016.11.001

There are 31 citations in total.

Details

Primary Language	English
Subjects	Artificial Intelligence (Other)
Journal Section	Research Article
Authors	Rasim Çekik 0000-0002-7820-413X Mahmut Kaya 0000-0002-7846-1769
Submission Date	October 20, 2023
Acceptance Date	November 17, 2023
Early Pub Date	December 12, 2023
Publication Date	December 31, 2023
Published in Issue	Year 2023 Volume: 10 Issue: 4

Cite

APA	Çekik, R., & Kaya, M. (2023). A New Feature Selection Metric Based on Rough Sets and Information Gain in Text Classification. Gazi University Journal of Science Part A: Engineering and Innovation, 10(4), 472-486. https://doi.org/10.54287/gujsa.1379024

Cited By

ANDClust: An Adaptive Neighborhood Distance‐Based Clustering Algorithm to Cluster Varying Density and/or Neck‐Typed Datasets

Advanced Theory and Simulations

https://doi.org/10.1002/adts.202301113

A RULE-BASED APPROACH USING THE ROUGH SET ON COVID-19 DATA

Eskişehir Osmangazi Üniversitesi Mühendislik ve Mimarlık Fakültesi Dergisi

https://doi.org/10.31796/ogummf.1420509

Effective Text Classification Through Supervised Rough Set-Based Term Weighting

Symmetry

https://doi.org/10.3390/sym17010090

Article Files

Full Text