TY - JOUR T1 - Investigating the Robustness of Text Mining Classification Algorithms: A Study of Algorithm and Expert Performance on Class-Inconsistent Data AU - Ataseven, Hüseyin AU - Çokluk-bökeoglu, Ömay PY - 2025 DA - September Y2 - 2025 DO - 10.35377/saucis...1626239 JF - Sakarya University Journal of Computer and Information Sciences JO - SAUCIS PB - Sakarya University WT - DergiPark SN - 2636-8129 SP - 422 EP - 440 VL - 8 IS - 3 LA - en AB - This study compares the classification accuracy of text mining algorithms for foreign language proficiency exam items. The dataset included 2,868 items from ÜDS English tests (2006–2012) across Natural and Applied Sciences (n=956), Health Sciences (n=956), and Social Sciences (n=956). Algorithms tested were k-Nearest Neighbors (kNN), Naïve Bayes (NB), Naïve Bayes-Kernel (NB-K), Random Forest (RF), and Support Vector Machines (SVM). Binary classification accuracies ranged from 83.08% (NB) to 92.48% (SVM), while multiclass accuracies ranged from 71.93% (NB) to 84.96% (kNN). Expert analysis and cross-validation identified class-inconsistent items that negatively affected accuracy. Removing these items improved binary classification by 7.39%–9.83% and multiclass classification by 10.58%–17.89%. Among algorithms, kNN was least impacted by class-inconsistent data. These findings highlight the importance of addressing inconsistencies for improving algorithmic performance, with kNN showing robust results across scenarios. KW - Text mining KW - Document classification KW - Class-inconsistent data KW - Robustness of classification algorithms CR - P. Baldi, S. Brunak, Y. Chauvin, C. A. F. Andersen, and H. Nielsen, “Assessing the accuracy of prediction algorithms for classification: an overview,” Bioinformatics, vol. 16, no. 5, pp. 412–424, May 2000, doi: 10.1093/bioinformatics/16.5.412. CR - K. Kowsari, K. J. Meimandi, M. Heidarysafa, S. Mendu, L. Barnes, and D. Brown, “Text classification algorithms: A survey,” Information, vol. 10, no. 4, Art. no. 150, Apr. 2019, doi: 10.3390/info10040150. CR - J. Riggs and T. Lalonde, Handbook for Applied Modeling: Non-Gaussian and Correlated Data. Cambridge, U.K.: Cambridge Univ. Press, 2017. CR - T. W. Anderson, An Introduction to Multivariate Statistical Analysis, 3rd ed. Hoboken, NJ, USA: Wiley-Interscience, 2003. CR - R. A. Fisher, “The use of multiple measurements in taxonomic problems,” Ann. Eugenics, vol. 7, no. 2, pp. 179–188, Sep. 1936. CR - S. Har-Peled, D. Roth, and D. Zimak, “Constraint classification for multiclass classification and ranking,” in Proc. 16th Int. Conf. Neural Inf. Process. Syst. (NIPS), Vancouver, BC, Canada, Dec. 2002, pp. 809–816. CR - N. Matloff, Statistical Regression and Classification: From Linear Models to Machine Learning. Boca Raton, FL, USA: CRC Press, 2017. CR - E. Apostolova and R. A. Kreek, “Training and prediction data discrepancies: Challenges of text classification with noisy, historical data,” in Proc. 2018 EMNLP Workshop W-NUT: 4th Workshop on Noisy User-Generated Text, Brussels, Belgium, Nov. 2018, pp. 104–109. doi: 10.18653/v1/W18-6114. CR - J. Wainer, “Comparison of 14 different families of classification algorithms on 115 binary datasets,” arXiv preprint, Jun. 2016, [Online]. Available: https://arxiv.org/abs/1606.00930v1 CR - A. Hotho, A. Nürnberger, and G. Paaß, “A brief survey of text mining,” J. Lang. Technol. Comput. Linguist., vol. 20, no. 1, pp. 19–62, Jul. 2005, doi: 10.21248/JLCL.20.2005.68. CR - C. D. Manning, P. Raghavan, and H. Schütze, Introduction to Information Retrieval. Cambridge, U.K.: Cambridge Univ. Press, 2008. doi: 10.1017/CBO9780511809071. CR - T. Graepel, J. Q. Candela, T. Borchert, and R. Herbrich, “Web-scale Bayesian click-through rate prediction for sponsored search advertising in Microsoft’s Bing search engine,” 2010, [Online]. Available: https://discovery.ucl.ac.uk/id/eprint/1395202/ CR - M. Richardson, E. Dominowska, and R. Ragno, “Predicting clicks: Estimating the click-through rate for new ads,” in Proc. 16th Int. World Wide Web Conf. (WWW), Banff, AB, Canada, 2007, pp. 521–530. doi: 10.1145/1242572.1242643. CR - Y. Yang, Y. Yang, B. Jansen, and M. Lalmas, “Computational advertising: A paradigm shift for advertising and marketing?,” IEEE Intell. Syst., vol. 32, no. 3, pp. 3–6, May 2017, doi: 10.1109/MIS.2017.58. CR - J. Burrell, “How the machine ‘thinks’: Understanding opacity in machine learning algorithms,” Big Data Soc., vol. 3, no. 1, Jan. 2016, doi: 10.1177/2053951715622512. CR - N. Jindal and B. Liu, “Opinion spam and analysis,” in Proc. 2008 Int. Conf. Web Search and Data Mining (WSDM), Palo Alto, CA, USA, Feb. 2008, pp. 219–229. doi: 10.1145/1341531.1341560. CR - E. P. Lim, V. A. Nguyen, N. Jindal, B. Liu, and H. W. Lauw, “Detecting product review spammers using rating behaviors,” in Proc. 19th ACM Int. Conf. Inf. Knowl. Manag. (CIKM), Toronto, ON, Canada, Oct. 2010, pp. 939–948. doi: 10.1145/1871437.1871557. CR - S. Redhu, “Sentiment analysis using text mining: A review,” Int. J. Data Sci. Technol., vol. 4, no. 2, p. 49, 2018, doi: 10.11648/J.IJDST.20180402.12. CR - C. Zucco, B. Calabrese, G. Agapito, P. H. Guzzi, and M. Cannataro, “Sentiment analysis for mining texts and social networks data: Methods and tools,” Wiley Interdiscip. Rev. Data Min. Knowl. Discov., vol. 10, no. 1, p. e1333, Jan. 2020, doi: 10.1002/widm.1333. CR - A. Dhar, H. Mukherjee, K. Roy, K. C. Santosh, and N. S. Dash, “Hybrid approach for text categorization: A case study with Bangla news article,” J. Inf. Sci., vol. 49, no. 3, pp. 762–777, Jun. 2023, doi: 10.1177/01655515211027770. CR - H. Gomes, M. de Castro Neto, and R. Henriques, “Text mining: Sentiment analysis on news classification,” in Proc. 8th Iberian Conf. Inf. Syst. Technol. (CISTI), Lisboa, Portugal, Jun. 2013, pp. 1–6. CR - M. Jamaluddin and A. D. Wibawa, “Patient diagnosis classification based on electronic medical record using text mining and support vector machine,” in Proc. 2021 Int. Seminar Appl. Technol. Inf. Commun. (iSemantic), Semarang, Indonesia, Sep. 2021, pp. 243–248. doi: 10.1109/isemantic52711.2021.9573178. CR - C. Luque, J. M. Luna, M. Luque, and S. Ventura, “An advanced review on text mining in medicine,” Wiley Interdiscip. Rev. Data Min. Knowl. Discov., vol. 9, no. 3, p. e1302, May 2019, doi: 10.1002/widm.1302. CR - Y. L. Chen, Y. H. Liu, and W. L. Ho, “A text mining approach to assist the general public in the retrieval of legal documents,” J. Am. Soc. Inf. Sci. Technol., vol. 64, no. 2, pp. 280–290, Feb. 2013, doi: 10.1002/asi.22767. CR - K. Berezka, O. Kovalchuk, S. Banakh, S. Zlyvko, and R. Hrechaniuk, “A binary logistic regression model for support decision making in criminal justice,” Folia Oecon. Stetin., vol. 22, no. 1, pp. 1–17, Jun. 2022, doi: 10.2478/foli-2022-0001. CR - D. L. Wilson, “Asymptotic properties of nearest neighbor rules using edited data,” IEEE Trans. Syst., Man, Cybern., vol. 2, no. 3, pp. 408–421, 1972, doi: 10.1109/tsmc.1972.4309137. CR - X. Zhu and X. Wu, “Class noise vs. attribute noise: A quantitative study,” Artif. Intell. Rev., vol. 22, no. 3, pp. 177–210, Nov. 2004, doi: 10.1007/s10462-004-0751-8. CR - L. Beretta and A. Santaniello, “Nearest neighbor imputation algorithms: a critical evaluation,” BMC Med. Inform. Decis. Mak., vol. 16, Jul. 2016, doi: 10.1186/s12911-016-0318-z. CR - H. Zhang, “The optimality of naive Bayes,” in Proc. 17th Int. Florida Artif. Intell. Res. Soc. Conf. (FLAIRS), Miami Beach, FL, USA, May 2004, pp. 562–567. [Online]. Available: https://cdn.aaai.org/FLAIRS/2004/Flairs04-097.pdf CR - B. Frénay and M. Verleysen, “Classification in the presence of label noise: A survey,” IEEE Trans. Neural Netw. Learn. Syst., vol. 25, no. 5, pp. 845–869, May 2014, doi: 10.1109/tnnls.2013.2292894. CR - N. Cesa-Bianchi, S. Shalev Shwartz, and O. Shamir, “Online learning of noisy data,” IEEE Trans. Inf. Theory, vol. 57, no. 12, pp. 7907–7931, Dec. 2013, doi: 10.1109/tit.2011.2164053. CR - J. Kim and C. D. Scott, “Robust kernel density estimation,” J. Mach. Learn. Res., vol. 13, pp. 2529–2565, 2012, [Online]. Available: https://www.jmlr.org/papers/volume13/kim12b/kim12b.pdf CR - N. Manwani and P. S. Sastry, “Noise tolerance under risk minimization,” IEEE Trans. Cybern., vol. 43, no. 3, pp. 1146–1151, Jun. 2013, doi: 10.1109/tsmcb.2012.2223460. CR - B. Biggio, G. Fumera, and F. Roli, “Multiple classifier systems for robust classifier design in adversarial environments,” Int. J. Mach. Learn. Cybern., vol. 1, no. 1–4, pp. 27–41, Dec. 2010, doi: 10.1007/s13042-010-0007-7. CR - L. Breiman, “Random forests,” Mach. Learn., vol. 45, no. 1, pp. 5–32, Oct. 2001, doi: 10.1023/a:1010933404324. CR - A. Folleco, T. M. Khoshgoftaar, J. Van Hulse, and L. Bullard, “Software quality modeling: The impact of class noise on the random forest classifier,” in Proc. IEEE Congr. Evol. Comput. (CEC), Hong Kong, Jun. 2008, pp. 3853–3859. doi: 10.1109/cec.2008.4631321. CR - J. Zhao, M. Kang, and Z. Han, “Robustness of classification algorithm in the face of label noise,” EAI Endorsed Trans IoT, vol. 9, no. 1, p. e5, Jun. 2023, doi: 10.4108/eetiot.v9i1.3270. CR - J. Wilton and N. Ye, “Robust loss functions for training decision trees with noisy labels,” in Proc. AAAI Conf. Artif. Intell., Mar. 2024, pp. 15859–15867. doi: 10.1609/aaai.v38i14.29516. CR - A. Srivastava and M. Sahami, Text mining: Classification, clustering, and applications. Boca Raton, FL, USA: CRC Press, 2009. CR - B. W. Silverman and M. C. Jones, “E. Fix and J. L. Hodges (1951): An important contribution to nonparametric discriminant analysis and density estimation: Commentary on Fix and Hodges (1951),” Int. Stat. Rev., vol. 57, no. 3, p. 233, Dec. 1989, doi: 10.2307/1403796. CR - E. Alpaydın, Introduction to Machine Learning, 4th ed. Cambridge, MA, USA: MIT Press, 2020. CR - E. Parzen, “On estimation of a probability density function and mode,” Ann. Math. Stat., vol. 33, no. 3, pp. 1065–1076, Sep. 1962, doi: 10.1214/aoms/1177704472. CR - A. Christmann and I. Steinwart, “Support vector machines,” in Support Vector Machines: Theory and Applications, L. T. Yang, Ed., Boston, MA, USA: Springer, 2005, pp. 93–123. CR - A. Cutler, D. R. Cutler, and J. R. Stevens, “Random forests,” in Ensemble Machine Learning: Methods and Applications, C. Zhang and Y. Ma, Eds., Boston, MA, USA: Springer, 2012, pp. 157–175. doi: 10.1007/978-1-4419-9326-7_5. CR - S. Vijayarani, M. J. Ilamathi, and M. Nithya, “Preprocessing techniques for text mining—an overview,” Int. J. Comput. Sci. Commun. Netw., vol. 5, no. 1, pp. 7–16, Feb. 2015, [Online]. Available: https://www.researchgate.net/publication/339529230_Preprocessing_Techniques_for_Text_Mining_-_An_Overview CR - M. Hossin and M. N. Sulaiman, “A review on evaluation metrics for data classification evaluations,” Int. J. Data Min. Knowl. Manag. Process, vol. 5, no. 2, pp. 1–11, Mar. 2015, doi: 10.5121/ijdkp.2015.5201. CR - K. Pearson, “Mathematical contributions to the theory of evolution. XII—On a generalised theory of alternative inheritance, with special reference to Mendel’s laws,” Proc. R. Soc. Lond., vol. 72, no. 477–486, pp. 505–509, Jan. 1904, doi: 10.1098/rspl.1903.0081. CR - J. Fan, S. Upadhye, and A. Worster, “Understanding receiver operating characteristic (ROC) curves,” Can. J. Emerg. Med., vol. 8, no. 1, pp. 19–20, Jan. 2006, doi: 10.1017/s1481803500013336. CR - İ. Akhun, “İki yüzde arasındaki farkın manidarlığının test edilmesi,” Ankara Univ. Egit. Bilim. Fak. Derg., vol. 15, no. 1, pp. 240–259, Jan. 1982, doi: 10.1501/egifak_0000000817. UR - https://doi.org/10.35377/saucis...1626239 L1 - https://dergipark.org.tr/en/download/article-file/4549073 ER -