Araştırma Makalesi
BibTex RIS Kaynak Göster

The Role of Feature Weighting Methods on Local Feature Selection Methods for Text Classification

Yıl 2022, , 672 - 682, 31.12.2022
https://doi.org/10.35193/bseufbd.993833

Öz

With the development of internet technologies, there has been a significant increase in textual data. Automatic text classification approaches have become important in order for these textual data to become meaningful. Feature selection and feature weighting have an important place in automatic text classification approaches. In this study, the effect of feature weighting methods on local feature selection methods is examined in detail. Two different weighting methods, three different local feature selection methods, three different criteria datasets, and two classifiers were used in the study. The highest Micro-F1 and Macro-F1 scores were 92.88 and 65.55 for the Reuters-21578 dataset, 99.02 and 98.15 for the 20Newsgroup dataset, and 97.19 and 93.40 for the Enron1 dataset. Experimental results show that better results are obtained with the combination of Odds Ratio (OR) feature selection method, Term Frequency (TF) feature weighting and Support Vector Machine (SVM) classifier.

Kaynakça

  • Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys (CSUR), 34(1), 1-47.
  • Uysal, A. K., & Gunal, S. (2014). The impact of preprocessing on text classification. Information Processing & Management, 50(1), 104-112.
  • Parlak, B., & Uysal, A. K. (2020). The effects of globalisation techniques on feature selection for text classification. Journal of Information Science, 0165551520930897.
  • Uysal, A. K. (2016). An improved global feature selection scheme for text classification. Expert Systems with Applications, 43, 82-92.
  • Parlak, B., & Uysal, A. K. (2021). A novel filter feature selection method for text classification: Extensive Feature Selector. Journal of Information Science, 0165551521991037.
  • Rehman, A., Javed, K., Babri, H. A., & Asim, M. N. (2018). Selection of the most relevant terms based on a max-min ratio metric for text classification. Expert Systems with Applications, 114, 78-96.
  • Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res., 3(Mar), 1289-1305.
  • Debole, F., & Sebastiani, F. (2004). Supervised term weighting for automated text categorization. In Text mining and its applications, 81-97.
  • Özgür, A., Özgür, L., & Güngör, T. (2005). Text categorization with class-based and corpus-based keyword selection. In International Symposium on Computer and Information Sciences, 606-615.
  • Taşcı, Ş., & Güngör, T. (2013). Comparison of text feature selection policies and using an adaptive framework. Expert Systems with Applications, 40(12), 4871-4886.
  • Uysal, A. K. (2018). On two-stage feature selection methods for text classification. IEEE Access, 6, 43233-43251.
  • Kou, G., Yang, P., Peng, Y., Xiao, F., Chen, Y., & Alsaadi, F. E. (2020). Evaluation of feature selection methods for text classification with small datasets using multiple criteria decision-making methods. Applied Soft Computing, 86, 105836.
  • Agnihotri, D., Verma, K., & Tripathi, P. (2017). Variable global feature selection scheme for automatic classification of text documents. Expert Systems with Applications, 81, 268-281.
  • Agnihotri, D., Verma, K., Tripathi, P., & Singh, B. K. (2019). Soft voting technique to improve the performance of global filter based feature selection in text corpus. Applied Intelligence, 49(4), 1597-1619.
  • Parlak, B., & Uysal, A. K. (2018). On feature weighting and selection for medical document classification. In Developments and advances in intelligent systems and applications, 269-282.
  • Porter, M. F. (1980). An algorithm for suffix stripping. Program.
  • Zong, W., Wu, F., Chu, L. K., & Sculli, D. (2015). A discriminative and semantic feature selection method for text categorization. International Journal of Production Economics, 165, 215-222.
  • Joachims, T. (1998, April). Text categorization with support vector machines: Learning with many relevant features. In European conference on machine learning, 137-142.
  • Theodoridis, S., Pikrakis, A., Koutroumbas, K., & Cavouras, D. (2010). Introduction to pattern recognition: a matlab approach. Academic Press.
  • Rehman, A., Javed, K., Babri, H. A., & Saeed, M. (2015). Relative discrimination criterion–A novel feature ranking method for text data. Expert Systems with Applications, 42(7), 3670-3681.
  • Schütze, H., Manning, C. D., & Raghavan, P. (2008). Introduction to information retrieval, 39, 234-265. Cambridge: Cambridge University Press.
  • Parlak, B. (2022). Class‐index corpus‐index measure: A novel feature selection method for imbalanced text data. Concurrency and Computation: Practice and Experience, 34(21), e7140.
  • Parlak, B., & Uysal, A. K. (2020). On classification of abstracts obtained from medical journals. Journal of Information Science, 46(5), 648-663.

Metin Sınıflandırma için Öznitelik Ağırlıklandırma Metotlarının Lokal Öznitelik Seçim Metotları Üzerindeki Rolü

Yıl 2022, , 672 - 682, 31.12.2022
https://doi.org/10.35193/bseufbd.993833

Öz

İnternet teknolojilerinin gelişimiyle birlikte metinsel verilerde ciddi bir artış yaşanmıştır. Bu metinsel verilerin anlamlı hale gelebilmesi için otomatik metin sınıflandırma yaklaşımları önemli hale gelmiştir. Otomatik metin sınıflandırma yaklaşımlarında öznitelik seçimi ve öznitelik ağırlıklandırma önemli bir yer tutar. Bu çalışmada, öznitelik ağırlıklandırma metotlarının lokal öznitelik seçim metotları üzerindeki etkisi ayrıntılı bir şekilde incelenmiştir. Çalışmada iki farklı ağırlıklandırma metodu, üç farklı lokal öznitelik seçim metodu, üç farklı kriter veri kümesi ve iki sınıflandırıcı kullanılmıştır. En yüksek Mikro-F1 ve Makro-F1 skoru, Reuters-21578 veri kümesi için 92.88 ve 65.55, 20Newsgroup veri kümesi için 99.02 ve 98.15, Enron1 veri kümesi için 97.19 ve 93.40’tır. Deneysel sonuçlar, OddsRatio (OR) öznitelik seçim metodu, Terim Frekansı (TF) öznitelik ağırlıklandırma ve Destek Vektör Makinesi (DVM) sınıflandırıcı kombinasyonu ile daha iyi sonucun elde edildiğini göstermektedir.

Kaynakça

  • Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys (CSUR), 34(1), 1-47.
  • Uysal, A. K., & Gunal, S. (2014). The impact of preprocessing on text classification. Information Processing & Management, 50(1), 104-112.
  • Parlak, B., & Uysal, A. K. (2020). The effects of globalisation techniques on feature selection for text classification. Journal of Information Science, 0165551520930897.
  • Uysal, A. K. (2016). An improved global feature selection scheme for text classification. Expert Systems with Applications, 43, 82-92.
  • Parlak, B., & Uysal, A. K. (2021). A novel filter feature selection method for text classification: Extensive Feature Selector. Journal of Information Science, 0165551521991037.
  • Rehman, A., Javed, K., Babri, H. A., & Asim, M. N. (2018). Selection of the most relevant terms based on a max-min ratio metric for text classification. Expert Systems with Applications, 114, 78-96.
  • Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res., 3(Mar), 1289-1305.
  • Debole, F., & Sebastiani, F. (2004). Supervised term weighting for automated text categorization. In Text mining and its applications, 81-97.
  • Özgür, A., Özgür, L., & Güngör, T. (2005). Text categorization with class-based and corpus-based keyword selection. In International Symposium on Computer and Information Sciences, 606-615.
  • Taşcı, Ş., & Güngör, T. (2013). Comparison of text feature selection policies and using an adaptive framework. Expert Systems with Applications, 40(12), 4871-4886.
  • Uysal, A. K. (2018). On two-stage feature selection methods for text classification. IEEE Access, 6, 43233-43251.
  • Kou, G., Yang, P., Peng, Y., Xiao, F., Chen, Y., & Alsaadi, F. E. (2020). Evaluation of feature selection methods for text classification with small datasets using multiple criteria decision-making methods. Applied Soft Computing, 86, 105836.
  • Agnihotri, D., Verma, K., & Tripathi, P. (2017). Variable global feature selection scheme for automatic classification of text documents. Expert Systems with Applications, 81, 268-281.
  • Agnihotri, D., Verma, K., Tripathi, P., & Singh, B. K. (2019). Soft voting technique to improve the performance of global filter based feature selection in text corpus. Applied Intelligence, 49(4), 1597-1619.
  • Parlak, B., & Uysal, A. K. (2018). On feature weighting and selection for medical document classification. In Developments and advances in intelligent systems and applications, 269-282.
  • Porter, M. F. (1980). An algorithm for suffix stripping. Program.
  • Zong, W., Wu, F., Chu, L. K., & Sculli, D. (2015). A discriminative and semantic feature selection method for text categorization. International Journal of Production Economics, 165, 215-222.
  • Joachims, T. (1998, April). Text categorization with support vector machines: Learning with many relevant features. In European conference on machine learning, 137-142.
  • Theodoridis, S., Pikrakis, A., Koutroumbas, K., & Cavouras, D. (2010). Introduction to pattern recognition: a matlab approach. Academic Press.
  • Rehman, A., Javed, K., Babri, H. A., & Saeed, M. (2015). Relative discrimination criterion–A novel feature ranking method for text data. Expert Systems with Applications, 42(7), 3670-3681.
  • Schütze, H., Manning, C. D., & Raghavan, P. (2008). Introduction to information retrieval, 39, 234-265. Cambridge: Cambridge University Press.
  • Parlak, B. (2022). Class‐index corpus‐index measure: A novel feature selection method for imbalanced text data. Concurrency and Computation: Practice and Experience, 34(21), e7140.
  • Parlak, B., & Uysal, A. K. (2020). On classification of abstracts obtained from medical journals. Journal of Information Science, 46(5), 648-663.
Toplam 23 adet kaynakça vardır.

Ayrıntılar

Birincil Dil Türkçe
Konular Mühendislik
Bölüm Makaleler
Yazarlar

Bekir Parlak 0000-0001-8919-6481

Yayımlanma Tarihi 31 Aralık 2022
Gönderilme Tarihi 10 Eylül 2021
Kabul Tarihi 15 Ağustos 2022
Yayımlandığı Sayı Yıl 2022

Kaynak Göster

APA Parlak, B. (2022). Metin Sınıflandırma için Öznitelik Ağırlıklandırma Metotlarının Lokal Öznitelik Seçim Metotları Üzerindeki Rolü. Bilecik Şeyh Edebali Üniversitesi Fen Bilimleri Dergisi, 9(2), 672-682. https://doi.org/10.35193/bseufbd.993833