Research Article
BibTex RIS Cite

The Effectiveness of Feature Selection Methods for Imbalanced Text Classification

Year 2023, Volume: 23 Issue: 2, 370 - 379, 03.05.2023
https://doi.org/10.35414/akufemubid.1172637

Abstract

The distribution of text data across classes is often imbalanced. This situation has a negative impact on
the performance of classifiers in the text classification process. Many studies have been performed on
imbalanced text classification. The feature selection stage, which is one of the important stages of the
text classification process, is also critical in the imbalanced text classification problem. The effect of
feature selection methods on the classification of imbalanced texts has been thoroughly investigated
in this study. In this direction, many experiments were carried out with three different classifiers and
nine different feature selection methods on two different data sets. In addition, the success of feature
selection methods has been observed employing different number of features. Nine different feature
selection methods called NDM, DFSS, PFS, POISSON, CHI2, IG, GINI, DFS and MDFS were evaluated.
Experimental results obtained with Support Vector Machines (SVM), Decision Tree (DTREE), and Naïve
Bayes (MNB) classifiers. On the Reuters-21578 dataset, DFS and CHI2 feature selection methods
obtained approximately 80 as the highest Macro-F1 score. On the SPAM SMS dataset, DFS feature
selection method obtained 95 and CHI2 feature selection method obtained 94 as the highest Macro-F1
score. It is seen that feature selection methods DFS and CHI2 are more successful than the others for
imbalanced text classification.

References

  • Chang, C.-C. and C.-J. Lin, LIBSVM: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2011. 2(3): p. 1-27.
  • Chen, L., L. Jiang, and C. Li, Modified DFS-based term weighting scheme for text classification. Expert Systems with Applications, 2021. 168: p. 114438.
  • Chen, X.-w. and M. Wasikowski. Fast: a roc-based feature selection metric for small samples and imbalanced data classification problems. in Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. 2008.
  • Chen, Y.-T. and M.C. Chen, Using chi-square statistics to measure similarities for text categorization. Expert systems with applications, 2011. 38(4): p. 3085-3090.
  • Forman, G., An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res., 2003. 3(Mar): p. 1289-1305.
  • Galar, M., et al., A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 2011. 42(4): p. 463-484.
  • He, H. and E.A. Garcia, Learning from imbalanced data. IEEE Transactions on knowledge and data engineering, 2009. 21(9): p. 1263-1284.
  • Liu, H. and L. Yu, Toward integrating feature selection algorithms for classification and clustering. IEEE Transactions on knowledge and data engineering, 2005. 17(4): p. 491-502.
  • Maldonado, S., R. Weber, and F. Famili, Feature selection for high-dimensional class-imbalanced data sets using support vector machines. Information sciences, 2014. 286: p. 228-246.
  • Moayedikia, A., et al., Feature selection for high dimensional imbalanced class data using harmony search. Engineering Applications of Artificial Intelligence, 2017. 57: p. 38-49.
  • Ogura, H., H. Amano, and M. Kondo, Feature selection with a measure of deviations from Poisson in text categorization. Expert Systems with Applications, 2009. 36(3): p. 6826-6832.
  • Ogura, H., H. Amano, and M. Kondo, Comparison of metrics for feature selection in imbalanced text classification. Expert Systems with Applications, 2011. 38(5): p. 4978-4989.
  • Pouramini, J., B. Minaei-Bidgoli, and M. Esmaeili, A novel feature selection method in the categorization of imbalanced textual data. KSII Transactions on Internet and Information Systems (TIIS), 2018. 12(8): p. 3725-3748.
  • Quinlan, J.R., Induction of decision trees. Machine learning, 1986. 1(1): p. 81-106.
  • Rehman, A., K. Javed, and H.A. Babri, Feature selection based on a normalized difference measure for text classification. Information Processing & Management, 2017. 53(2): p. 473-489.
  • Schütze, H., C.D. Manning, and P. Raghavan, Introduction to information retrieval. Vol. 39. 2008: Cambridge University Press Cambridge.
  • Shang, W., et al., A novel feature selection algorithm for text categorization. Expert Systems with Applications, 2007. 33(1): p. 1-5.
  • Uysal, A.K. and S. Gunal, A novel probabilistic feature selection method for text classification. Knowledge-Based Systems, 2012. 36: p. 226-235.
  • Uysal, A.K., et al. Detection of SMS spam messages on mobile phones. in 2012 20th Signal Processing and Communications Applications Conference (SIU). 2012. Ieee.
  • Witten, I.H. and E. Frank, Data mining: practical machine learning tools and techniques with Java implementations. Acm Sigmod Record, 2002. 31(1): p. 76-77.
  • Zong, W., et al., A discriminative and semantic feature selection method for text categorization. International Journal of Production Economics, 2015. 165: p. 215-222.

Dengesiz Metin Sınıflandırmada Öznitelik Seçim Yöntemlerinin Etkililiği

Year 2023, Volume: 23 Issue: 2, 370 - 379, 03.05.2023
https://doi.org/10.35414/akufemubid.1172637

Abstract

Metin verilerinin sınıflar arasında dağılımı genellikle eşit değildir. Bu durum, metin sınıflandırma
işleminde sınıflandırıcıların performansına olumsuz yansımaktadır. Dengesiz metin sınıflandırma ile ilgili
birçok çalışma yapılmıştır. Metin sınıflandırma işleminin önemli aşamalarından olan öznitelik seçim
aşaması, dengesiz metin probleminde de kritik öneme sahiptir. Öznitelik seçme metotlarının dengesiz
metinlerin sınıflandırılması üzerindeki etkisi bu çalışmada etraflıca araştırılmıştır. Bu doğrultuda, iki
farklı veri seti üzerinde üç farklı sınıflandırıcı ve dokuz farklı öznitelik seçim metodu ile birçok deney
yapılmıştır. Ayrıca öznitelik seçim yöntemlerinin başarıları farklı öznitelik sayılarında da gözlemlenmiştir.
NDM, DFSS, PFS, POISSON, CHI2, IG, GINI, DFS ve MDFS olarak adlandırılan 9 farklı öznitelik seçim
metodu değerlendirilmiştir. Destek Vektör Makinesi (SVM), Karar Ağacı (DTREE) ve Basit Bayes (MNB)
sınıflandırıcıları ile deneysel sonuçlar elde edilmiştir. Reuters-21578 veri setinde DFS ve CHI2 öznitelik
seçim yöntemleri Makro-F1 değerlendirme metriği üzerinden yaklaşık en yüksek 80 değerini alırken,
SPAM SMS veri setinde, DFS öznitelik seçim yöntemi en yüksek skor olarak 95 ve CHI2 öznitelik seçim
yöntemi 94 değerlerini almıştır. Öznitelik seçme metotlarından DFS ve CHI2’nin dengesiz metin
sınıflandırmada daha başarılı olduğu görülmektedir.

References

  • Chang, C.-C. and C.-J. Lin, LIBSVM: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2011. 2(3): p. 1-27.
  • Chen, L., L. Jiang, and C. Li, Modified DFS-based term weighting scheme for text classification. Expert Systems with Applications, 2021. 168: p. 114438.
  • Chen, X.-w. and M. Wasikowski. Fast: a roc-based feature selection metric for small samples and imbalanced data classification problems. in Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. 2008.
  • Chen, Y.-T. and M.C. Chen, Using chi-square statistics to measure similarities for text categorization. Expert systems with applications, 2011. 38(4): p. 3085-3090.
  • Forman, G., An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res., 2003. 3(Mar): p. 1289-1305.
  • Galar, M., et al., A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 2011. 42(4): p. 463-484.
  • He, H. and E.A. Garcia, Learning from imbalanced data. IEEE Transactions on knowledge and data engineering, 2009. 21(9): p. 1263-1284.
  • Liu, H. and L. Yu, Toward integrating feature selection algorithms for classification and clustering. IEEE Transactions on knowledge and data engineering, 2005. 17(4): p. 491-502.
  • Maldonado, S., R. Weber, and F. Famili, Feature selection for high-dimensional class-imbalanced data sets using support vector machines. Information sciences, 2014. 286: p. 228-246.
  • Moayedikia, A., et al., Feature selection for high dimensional imbalanced class data using harmony search. Engineering Applications of Artificial Intelligence, 2017. 57: p. 38-49.
  • Ogura, H., H. Amano, and M. Kondo, Feature selection with a measure of deviations from Poisson in text categorization. Expert Systems with Applications, 2009. 36(3): p. 6826-6832.
  • Ogura, H., H. Amano, and M. Kondo, Comparison of metrics for feature selection in imbalanced text classification. Expert Systems with Applications, 2011. 38(5): p. 4978-4989.
  • Pouramini, J., B. Minaei-Bidgoli, and M. Esmaeili, A novel feature selection method in the categorization of imbalanced textual data. KSII Transactions on Internet and Information Systems (TIIS), 2018. 12(8): p. 3725-3748.
  • Quinlan, J.R., Induction of decision trees. Machine learning, 1986. 1(1): p. 81-106.
  • Rehman, A., K. Javed, and H.A. Babri, Feature selection based on a normalized difference measure for text classification. Information Processing & Management, 2017. 53(2): p. 473-489.
  • Schütze, H., C.D. Manning, and P. Raghavan, Introduction to information retrieval. Vol. 39. 2008: Cambridge University Press Cambridge.
  • Shang, W., et al., A novel feature selection algorithm for text categorization. Expert Systems with Applications, 2007. 33(1): p. 1-5.
  • Uysal, A.K. and S. Gunal, A novel probabilistic feature selection method for text classification. Knowledge-Based Systems, 2012. 36: p. 226-235.
  • Uysal, A.K., et al. Detection of SMS spam messages on mobile phones. in 2012 20th Signal Processing and Communications Applications Conference (SIU). 2012. Ieee.
  • Witten, I.H. and E. Frank, Data mining: practical machine learning tools and techniques with Java implementations. Acm Sigmod Record, 2002. 31(1): p. 76-77.
  • Zong, W., et al., A discriminative and semantic feature selection method for text categorization. International Journal of Production Economics, 2015. 165: p. 215-222.
There are 21 citations in total.

Details

Primary Language Turkish
Subjects Computer Software
Journal Section Articles
Authors

Hande Tiryaki 0000-0002-1533-6901

Alper Kürşat Uysal 0000-0002-4057-934X

Early Pub Date April 28, 2023
Publication Date May 3, 2023
Submission Date September 10, 2022
Published in Issue Year 2023 Volume: 23 Issue: 2

Cite

APA Tiryaki, H., & Uysal, A. K. (2023). Dengesiz Metin Sınıflandırmada Öznitelik Seçim Yöntemlerinin Etkililiği. Afyon Kocatepe Üniversitesi Fen Ve Mühendislik Bilimleri Dergisi, 23(2), 370-379. https://doi.org/10.35414/akufemubid.1172637
AMA Tiryaki H, Uysal AK. Dengesiz Metin Sınıflandırmada Öznitelik Seçim Yöntemlerinin Etkililiği. Afyon Kocatepe Üniversitesi Fen Ve Mühendislik Bilimleri Dergisi. May 2023;23(2):370-379. doi:10.35414/akufemubid.1172637
Chicago Tiryaki, Hande, and Alper Kürşat Uysal. “Dengesiz Metin Sınıflandırmada Öznitelik Seçim Yöntemlerinin Etkililiği”. Afyon Kocatepe Üniversitesi Fen Ve Mühendislik Bilimleri Dergisi 23, no. 2 (May 2023): 370-79. https://doi.org/10.35414/akufemubid.1172637.
EndNote Tiryaki H, Uysal AK (May 1, 2023) Dengesiz Metin Sınıflandırmada Öznitelik Seçim Yöntemlerinin Etkililiği. Afyon Kocatepe Üniversitesi Fen Ve Mühendislik Bilimleri Dergisi 23 2 370–379.
IEEE H. Tiryaki and A. K. Uysal, “Dengesiz Metin Sınıflandırmada Öznitelik Seçim Yöntemlerinin Etkililiği”, Afyon Kocatepe Üniversitesi Fen Ve Mühendislik Bilimleri Dergisi, vol. 23, no. 2, pp. 370–379, 2023, doi: 10.35414/akufemubid.1172637.
ISNAD Tiryaki, Hande - Uysal, Alper Kürşat. “Dengesiz Metin Sınıflandırmada Öznitelik Seçim Yöntemlerinin Etkililiği”. Afyon Kocatepe Üniversitesi Fen Ve Mühendislik Bilimleri Dergisi 23/2 (May 2023), 370-379. https://doi.org/10.35414/akufemubid.1172637.
JAMA Tiryaki H, Uysal AK. Dengesiz Metin Sınıflandırmada Öznitelik Seçim Yöntemlerinin Etkililiği. Afyon Kocatepe Üniversitesi Fen Ve Mühendislik Bilimleri Dergisi. 2023;23:370–379.
MLA Tiryaki, Hande and Alper Kürşat Uysal. “Dengesiz Metin Sınıflandırmada Öznitelik Seçim Yöntemlerinin Etkililiği”. Afyon Kocatepe Üniversitesi Fen Ve Mühendislik Bilimleri Dergisi, vol. 23, no. 2, 2023, pp. 370-9, doi:10.35414/akufemubid.1172637.
Vancouver Tiryaki H, Uysal AK. Dengesiz Metin Sınıflandırmada Öznitelik Seçim Yöntemlerinin Etkililiği. Afyon Kocatepe Üniversitesi Fen Ve Mühendislik Bilimleri Dergisi. 2023;23(2):370-9.