EN
A New Feature Selection Metric Based on Rough Sets and Information Gain in Text Classification
Abstract
In text classification, taking words in text documents as features creates a very high dimensional feature space. This is known as the high dimensionality problem in text classification. The most common and effective way to solve this problem is to select an ideal subset of features using a feature selection approach. In this paper, a new feature selection approach called Rough Information Gain (RIG) is presented as a solution to the high dimensionality problem. Rough Information Gain extracts hidden and meaningful patterns in text data with the help of Rough Sets and computes a score value based on these patterns. The proposed approach utilizes the selection strategy of the Information Gain Selection (IG) approach when pattern extraction is completely uncertain. To demonstrate the performance of the Rough Information Gain in the experimental studies, the Micro-F1 success metric is used to compare with Information Gain Selection (IG), Chi-Square (CHI2), Gini Coefficient (GI), Discriminative Feature Selector (DFS) approaches. The proposed Rough Information Gain approach outperforms the other methods in terms of performance, according to the results.
Keywords
References
- Aggarwal, C., & Zhai, C. (2012). A survey of text classification algorithms. In: C. C. Aggarwal, & C Zhai (Eds.), Mining text data (pp. 163-222). https://doi.org/10.1007/978-1-4614-3223-4_6
- Alberto, T. C., Lochter, J. V., & Almeida, T. A. (2015, December 9-11). Tubespam: Comment spam filtering on youtube. In: Proceedings of the IEEE 14th International Conference on Machine Learning and Applications (ICMLA), Miami, Florida. https://doi.org/10.1109/ICMLA.2015.37
- Bermejo, P., De la Ossa, L., G´amez, J., & Puerta, J. (2012). Fast wrapper feature subset selection in highdimensional datasets by means of filter re-ranking. Knowledge Based Systems, 25(1), 35-44. https://doi.org/10.1016/j.knosys.2011.01.015
- Cekik, R., & Uysal, A. K. (2020). A novel filter feature selection method using rough set for short text data. Expert Systems with Applications, 160, 113691. https://doi.org/10.1016/j.eswa.2020.113691
- Cekik, R., & Uysal, A. K. (2022). A new metric for feature selection on short text datasets. Concurrency and Computation: Practice and Experience, 34(13), e6909. https://doi.org/10.1002/cpe.6909
- Chen, J., Huang, H., Tian, S., & Qu, Y. (2009). Feature selection for text classification with Naïve Bayes. Expert Systems with Applications, 36(3), 5432-5435. https://doi.org/10.1016/j.eswa.2008.06.054
- Chou, C., Sinha, A., & Zhao, H. (2010). A hybrid attribute selection approach for text classification. Journal of the Association for Information Systems, 11(9), 491. https://doi.org/10.17705/1jais.00236
- Ghareb, A., Bakar, A., & Hamdan, A. (2016). Hybrid feature selection based on enhanced genetic algorithm for text categorization. Expert Systems with Applications, 49, 31-47. https://doi.org/10.1016/j.eswa.2015.12.004
Details
Primary Language
English
Subjects
Artificial Intelligence (Other)
Journal Section
Research Article
Early Pub Date
December 12, 2023
Publication Date
December 31, 2023
Submission Date
October 20, 2023
Acceptance Date
November 17, 2023
Published in Issue
Year 2023 Volume: 10 Number: 4
APA
Çekik, R., & Kaya, M. (2023). A New Feature Selection Metric Based on Rough Sets and Information Gain in Text Classification. Gazi University Journal of Science Part A: Engineering and Innovation, 10(4), 472-486. https://doi.org/10.54287/gujsa.1379024
AMA
1.Çekik R, Kaya M. A New Feature Selection Metric Based on Rough Sets and Information Gain in Text Classification. GU J Sci, Part A. 2023;10(4):472-486. doi:10.54287/gujsa.1379024
Chicago
Çekik, Rasim, and Mahmut Kaya. 2023. “A New Feature Selection Metric Based on Rough Sets and Information Gain in Text Classification”. Gazi University Journal of Science Part A: Engineering and Innovation 10 (4): 472-86. https://doi.org/10.54287/gujsa.1379024.
EndNote
Çekik R, Kaya M (December 1, 2023) A New Feature Selection Metric Based on Rough Sets and Information Gain in Text Classification. Gazi University Journal of Science Part A: Engineering and Innovation 10 4 472–486.
IEEE
[1]R. Çekik and M. Kaya, “A New Feature Selection Metric Based on Rough Sets and Information Gain in Text Classification”, GU J Sci, Part A, vol. 10, no. 4, pp. 472–486, Dec. 2023, doi: 10.54287/gujsa.1379024.
ISNAD
Çekik, Rasim - Kaya, Mahmut. “A New Feature Selection Metric Based on Rough Sets and Information Gain in Text Classification”. Gazi University Journal of Science Part A: Engineering and Innovation 10/4 (December 1, 2023): 472-486. https://doi.org/10.54287/gujsa.1379024.
JAMA
1.Çekik R, Kaya M. A New Feature Selection Metric Based on Rough Sets and Information Gain in Text Classification. GU J Sci, Part A. 2023;10:472–486.
MLA
Çekik, Rasim, and Mahmut Kaya. “A New Feature Selection Metric Based on Rough Sets and Information Gain in Text Classification”. Gazi University Journal of Science Part A: Engineering and Innovation, vol. 10, no. 4, Dec. 2023, pp. 472-86, doi:10.54287/gujsa.1379024.
Vancouver
1.Rasim Çekik, Mahmut Kaya. A New Feature Selection Metric Based on Rough Sets and Information Gain in Text Classification. GU J Sci, Part A. 2023 Dec. 1;10(4):472-86. doi:10.54287/gujsa.1379024
Cited By
ANDClust: An Adaptive Neighborhood Distance‐Based Clustering Algorithm to Cluster Varying Density and/or Neck‐Typed Datasets
Advanced Theory and Simulations
https://doi.org/10.1002/adts.202301113A RULE-BASED APPROACH USING THE ROUGH SET ON COVID-19 DATA
Eskişehir Osmangazi Üniversitesi Mühendislik ve Mimarlık Fakültesi Dergisi
https://doi.org/10.31796/ogummf.1420509