estuscience - se

Eskişehir Technical University Journal of Science and Technology A - Applied Sciences and Engineering

2667-4211

Eskisehir Technical University

10.18038/estubtda.1784468

Supervised Learning Classification Algorithms Natural Language Processing

Denetimli Öğrenme Sınıflandırma algoritmaları Doğal Dil İşleme

NOVEL TERM WEIGHTING METHODS FOR TEXT CLASSIFICATION BASED ON ECONOMIC INEQUALITY METRICS

https://orcid.org/0000-0002-2137-5253

Okkalıoğlu

Murat

YALOVA ÜNİVERSİTESİ

03 27 2026

27 1 125 148 09 15 2025 03 17 2026

2000

Eskişehir Technical University Journal of Science and Technology A - Applied Sciences and Engineering

Term weighting plays a critical role in text classification tasks. Traditional methods, with a few exceptions, make limited or inadequate use of distributional characteristics of terms across classes. The core hypothesis in this study is that a term’s weight should be proportional to its uneven distribution across classes. Therefore, the proposed methods prioritize terms concentrated around one or a few classes rather than terms almost evenly distrusted across all classes. To implement this idea, we introduce a family of novel term weighting methods based on economic inequality metrics. These metrics are typically used to measure the unfairness of income distribution in a population, and adapt them to characterize term distributions. To quantify distributional unevenness or imbalance to assess term significance, we select one representative method from each of three major categories of inequality indices: Lorenz curve-based (Schultz), entropy-based (Theil with two variants), and social welfare-based (Atkinson). Experiments with four benchmark datasets (20NG, R8, R52, and WebKB) using two classifiers (Multinomial Naïve Bayes and Support Vector Machines) on f1-micro and f1-macro evaluation metrics have been conducted. The experimental results demonstrate that the proposed term weighting methods, particularly the method based on Schultz index, consistently demonstrate superior or highly competitive performances compared to both traditional and state-of-the-art term weighting approaches. Experimental findings confirm the validity of exploiting economic inequality principles for quantifying inter-class distributional characteristics of terms in term weighting. Thus, this work not only validates the effectives of proposed methods but also demonstrate the value of interdisciplinary work in term weighting literature.

Economic inequality Class distribution Term weighting Text classification

[1] ITU. Measuring digital development: Facts and Figures 2024. International Telecommunication Union, 2024.

[2] Mao Y, Liu Q, Zhang Y. Sentiment analysis methods, applications, and challenges: A systematic literature review. J King Saud Univ Comput Inf Sci 2024; 36(4).

[3] Ahmed N, Amin R, Aldabbas H, Koundal D, Alouffi B, Shah T. Machine learning techniques for spam detection in email and iot platforms: analysis and research challenges. Secur Commun Netw 2022.

[4] Wu H, Zhang Z, Wu W. Exploring syntactic and semantic features for authorship attribution. App Soft Comput 2021; 111.

[5] Alnabhan MQ, Branco P. Fake news detection using deep learning: a systematic literature review. IEEE Access 2024; 12: 114435-114459.

[6] Sun G, Cheng Y, Zhang Z, Tong X, Chai T. Text classification with improved word embedding and adaptive segmentation. Expert Syst Appl 2024; 238.

[7] Schutz RR. On the measurement of income inequality. Am Econ Rev 1951; 41(1): 107-122.

[8] Hoover EM. The measurement of industrial localization. Rev Econ Stat 1936; 18(4): 162-171.

[9] Theil H. Economics and Information Theory. Amsterdam: North-Holland, 1967.

[10] Atkinson AB. On the measurement of inequality. J Econ Theory 1970; 2(3): 244–263.

[11] Luhn HP. The automatic creation of literature abstracts. IBM J Res Dev 1958; 2(2): 159-165.

[12] Jones SK. A statistical interpretation of term specificity and its application in retrieval. J Doc 1972; 28(1): 11-21.

[13] Salton G, Wong A, Yang CS. A vector space model for automatic indexing. Commun ACM 1975; 18(11): 613-620.

[14] Robertson S. Understanding inverse document frequency: on theoretical arguments for IDF. J Doc 2004; 60(5): 503-520.

[15] Tokunaga T, Makoto I. Text categorization based on weighted inverse document frequency. Tokyo: Dept of Computer Science, Tokyo Institute of Technology; Technical Report 94-TR00001, 1994.

[16] Debole F, Sebastiani F. Supervised term weighting for automated text categorization. In: 2003 ACM Symposium on Applied Computing; 9-12 March 2003; Melbourne, FL, USA. New York, NY, USA: ACM. pp. 784–788.

[17] Lan M, Tan CL, Su J, Lu Y. Supervised and traditional term weighting methods for automatic text categorization. IEEE T Pattern Anal 2009; 31(4).

[18] Parlak B, Uysal AK. The effects of globalisation techniques on feature selection for text classification. J Inf Sci 2021; 47(6): 727-739.

[19] Parlak B. A novel feature and class-based globalization technique for text classification. Multimed Tools Appl 2023; 82(24).

[20] Liu Y, Loh HT, Sun A. Imbalanced text classification: A term weighting approach. Expert Syst Appl 2009; 36(1): 690–701.

[21] Wang D, Zhang H. Inverse-Category-Frequency based supervised term weighting scheme for text categorization. J Inf Sci Eng 2010; 29(2): 209–225.

[22] Ren F, Sohrab MG. Class-indexing-based term weighting for automatic text classification. Inform Sciences 2013; 236: 109–125.

[23] Chen K, Zhang Z, Long J, Zhang H. Turning from TF-IDF to TF-IGM for term weighting in text classification. Expert Syst Appl 2016; 66: 245–260.

[24] Dogan T, Uysal AK. Improved inverse gravity moment term weighting for text classification. Expert Syst Appl 2019; 130: 45–59.

[25] Okkalioglu M. TF-IGM revisited: imbalanced text classification with relative imbalance ratio. Expert Syst Appl 2023; 217.

[26] Ko Y. A new term-weighting scheme for text classification using the odds of positive and negative class probabilities. J Assoc Inf Sci Tech 2015; 66 (12): 2397-2722.

[27] Allison PD. Measures of inequality. Am Sociol Rev 1978; 43(6): 865-880.

[28] Atkinson BA, Bourguignon F. Handbook of Income Distribution Volume 2B. 1st ed. Amsterdam, Holland: North-Holland, 2015.

[29] Idrees M, Ahmad E. Measurement of income inequality: A survey. Forman Journal of Economic Studies 2017; 13: 1-32.

[30] Lorenz MO. Methods of measuring concentration of wealth. Publ Am Stat Assoc 1905; 9(70): 209-219.

[31] Felman J. Income inequality measures. Theoretical Ecomomic Letters 2018; 8: 557-574.

[32] Okkalioglu M. A novel redistribution-based feature selection for text classification. Expert Syst Appl 2024; 246.

[33] Wang T, Cai Y, Leung Hf, Raymond YK, Haoran X, Qing L. On entropy-based term weighting schemes for text categorization. Knowl Inf Syst 2021; 63: 2313–2346.

[34] Kullback S, Leibler RA. On information and sufficiency. Ann Math Statist 1951; 22(1): 79-86.

[35] Cardoso-Cachopo A. Improving methods for single-label text categorization. PhD Thesis, Instituto Superior Tecnico, Universidade Tecnica de Lisboa, Lisbon, Portugal 2007.

[36] Craven M, DiPasquo D, Freitag D, McCallum A, Mitchell T, Nigam K, Slattery S. In: AAAI '98/IAAI '98: Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence; Madison, Wisconsin, USA: American Association for Artificial Intelligence. pp. 509-516.

[37] Forman G. An extensive empirical study of feature selection metrics for text classification. J Mach Learn Res 2003; 3: 1289-1305.

[38] Parlak B, Uysal AK. A novel filter feature selection method for text classification: Extensive Feature Selector. J Inf Sci 2023; 49(1): 59-78.

[39] Parlak B. The effects of preprocessing on Turkish and English News Data. Sakarya University Journal of Computer and Information Sciences 2023; 6(1): 59-66.

[40] Demsar J. Statistical comparison of classifiers over multiple data sets. J Mach Learn Res 2006; 7: 1-30.

[41] Iman RL, Davenport JM. Approximations of the critical region of the Friedman statistics. Commun Stat 1980; 571-595.

[42] Nemeyni PB. Distribution-free multiple comparisons. PhD Thesis, Princeton University, 1963.

[43] Dun OJ. Multiple comparisons among means. J Am Stat Assoc 1961; 56: 52-64.

[44] Holm S. A simple sequentially rejective multiple test procedure. Scand J Stat 1979; 6: 65-70.