EN
Investigating the Robustness of Text Mining Classification Algorithms: A Study of Algorithm and Expert Performance on Class-Inconsistent Data
Abstract
This study compares the classification accuracy of text mining algorithms for foreign language proficiency exam items. The dataset included 2,868 items from ÜDS English tests (2006–2012) across Natural and Applied Sciences (n=956), Health Sciences (n=956), and Social Sciences (n=956). Algorithms tested were k-Nearest Neighbors (kNN), Naïve Bayes (NB), Naïve Bayes-Kernel (NB-K), Random Forest (RF), and Support Vector Machines (SVM). Binary classification accuracies ranged from 83.08% (NB) to 92.48% (SVM), while multiclass accuracies ranged from 71.93% (NB) to 84.96% (kNN). Expert analysis and cross-validation identified class-inconsistent items that negatively affected accuracy. Removing these items improved binary classification by 7.39%–9.83% and multiclass classification by 10.58%–17.89%. Among algorithms, kNN was least impacted by class-inconsistent data. These findings highlight the importance of addressing inconsistencies for improving algorithmic performance, with kNN showing robust results across scenarios.
Keywords
References
- P. Baldi, S. Brunak, Y. Chauvin, C. A. F. Andersen, and H. Nielsen, “Assessing the accuracy of prediction algorithms for classification: an overview,” Bioinformatics, vol. 16, no. 5, pp. 412–424, May 2000, doi: 10.1093/bioinformatics/16.5.412.
- K. Kowsari, K. J. Meimandi, M. Heidarysafa, S. Mendu, L. Barnes, and D. Brown, “Text classification algorithms: A survey,” Information, vol. 10, no. 4, Art. no. 150, Apr. 2019, doi: 10.3390/info10040150.
- J. Riggs and T. Lalonde, Handbook for Applied Modeling: Non-Gaussian and Correlated Data. Cambridge, U.K.: Cambridge Univ. Press, 2017.
- T. W. Anderson, An Introduction to Multivariate Statistical Analysis, 3rd ed. Hoboken, NJ, USA: Wiley-Interscience, 2003.
- R. A. Fisher, “The use of multiple measurements in taxonomic problems,” Ann. Eugenics, vol. 7, no. 2, pp. 179–188, Sep. 1936.
- S. Har-Peled, D. Roth, and D. Zimak, “Constraint classification for multiclass classification and ranking,” in Proc. 16th Int. Conf. Neural Inf. Process. Syst. (NIPS), Vancouver, BC, Canada, Dec. 2002, pp. 809–816.
- N. Matloff, Statistical Regression and Classification: From Linear Models to Machine Learning. Boca Raton, FL, USA: CRC Press, 2017.
- E. Apostolova and R. A. Kreek, “Training and prediction data discrepancies: Challenges of text classification with noisy, historical data,” in Proc. 2018 EMNLP Workshop W-NUT: 4th Workshop on Noisy User-Generated Text, Brussels, Belgium, Nov. 2018, pp. 104–109. doi: 10.18653/v1/W18-6114.
Details
Primary Language
English
Subjects
Software Engineering (Other)
Journal Section
Research Article
Authors
Early Pub Date
September 24, 2025
Publication Date
September 30, 2025
Submission Date
January 24, 2025
Acceptance Date
July 16, 2025
Published in Issue
Year 2025 Volume: 8 Number: 3
APA
Ataseven, H., & Çokluk-bökeoglu, Ö. (2025). Investigating the Robustness of Text Mining Classification Algorithms: A Study of Algorithm and Expert Performance on Class-Inconsistent Data. Sakarya University Journal of Computer and Information Sciences, 8(3), 422-440. https://doi.org/10.35377/saucis...1626239
AMA
1.Ataseven H, Çokluk-bökeoglu Ö. Investigating the Robustness of Text Mining Classification Algorithms: A Study of Algorithm and Expert Performance on Class-Inconsistent Data. SAUCIS. 2025;8(3):422-440. doi:10.35377/saucis.1626239
Chicago
Ataseven, Hüseyin, and Ömay Çokluk-bökeoglu. 2025. “Investigating the Robustness of Text Mining Classification Algorithms: A Study of Algorithm and Expert Performance on Class-Inconsistent Data”. Sakarya University Journal of Computer and Information Sciences 8 (3): 422-40. https://doi.org/10.35377/saucis. 1626239.
EndNote
Ataseven H, Çokluk-bökeoglu Ö (September 1, 2025) Investigating the Robustness of Text Mining Classification Algorithms: A Study of Algorithm and Expert Performance on Class-Inconsistent Data. Sakarya University Journal of Computer and Information Sciences 8 3 422–440.
IEEE
[1]H. Ataseven and Ö. Çokluk-bökeoglu, “Investigating the Robustness of Text Mining Classification Algorithms: A Study of Algorithm and Expert Performance on Class-Inconsistent Data”, SAUCIS, vol. 8, no. 3, pp. 422–440, Sept. 2025, doi: 10.35377/saucis...1626239.
ISNAD
Ataseven, Hüseyin - Çokluk-bökeoglu, Ömay. “Investigating the Robustness of Text Mining Classification Algorithms: A Study of Algorithm and Expert Performance on Class-Inconsistent Data”. Sakarya University Journal of Computer and Information Sciences 8/3 (September 1, 2025): 422-440. https://doi.org/10.35377/saucis. 1626239.
JAMA
1.Ataseven H, Çokluk-bökeoglu Ö. Investigating the Robustness of Text Mining Classification Algorithms: A Study of Algorithm and Expert Performance on Class-Inconsistent Data. SAUCIS. 2025;8:422–440.
MLA
Ataseven, Hüseyin, and Ömay Çokluk-bökeoglu. “Investigating the Robustness of Text Mining Classification Algorithms: A Study of Algorithm and Expert Performance on Class-Inconsistent Data”. Sakarya University Journal of Computer and Information Sciences, vol. 8, no. 3, Sept. 2025, pp. 422-40, doi:10.35377/saucis. 1626239.
Vancouver
1.Hüseyin Ataseven, Ömay Çokluk-bökeoglu. Investigating the Robustness of Text Mining Classification Algorithms: A Study of Algorithm and Expert Performance on Class-Inconsistent Data. SAUCIS. 2025 Sep. 1;8(3):422-40. doi:10.35377/saucis. 1626239
