Text classification by machine learning algorithms using a new text feature extraction method based on image processing

Ahmet Çelik; Deniz Kaptan

doi:10.31127/tuje.1718023

Research Article

Text classification by machine learning algorithms using a new text feature extraction method based on image processing

Year 2025, Volume: 9 Issue: 4, 712 - 724, 08.10.2025

Ahmet Çelik , Deniz Kaptan

https://doi.org/10.31127/tuje.1718023

Abstract

Accurate text and character identification on documents using smart technologies is a very important method of obtaining data. The complex and irregular text and characters on the images, as well as the use of different writing styles, affect the text recognition success of both Artificial Intelligence (AI) and Machine Learning (ML) technologies. Manually transferring texts and characters from paper format documents to digital media creates a great waste of time and labor. In addition, when documents containing direct text are scanned and transferred in a computer environment, the texts cannot be edited. OCR (Optical Character Recognition) methods, which are proposed as a solution to this situation, are one of the Natural Language Processing (NLP) tasks. In particular, it has been observed that even in current artificial intelligence-based OCR software, the characters 0 and O are confused with each other. In this study, it is suggested that image pre-processing should be done on images containing characters in order to increase the success of character recognition. In the study, a new model was designed to increase the success of correctly recognizing 0 and O characters that are very similar to each other. In the study, image pre-processing was applied to the images of 408 characters. Classification successes were measured by using kNN, SVM and Logistic Regression algorithms on the data set. Additionally, the classification performance of 0 and O characters was measured on the artificial intelligence-based Google Documents tool. According to the results obtained, the success of recognizing 0 and O characters with the LR machine learning algorithm was realized at the rate of 1.00 according to the performance metrics.

Keywords

Artificial Intelligence , Machine learning , Image processing , Feature extraction , Character recognition , Digital data processing

References

Manwatkar, P.M. & Singh, K.R. (2015). A technical review on text recognition from images. Proceedings of the IEEE 9th International Conference on Intelligent Systems and Control (ISCO), Coimbatore, India, 1-5. https://doi.org/10.1109/ISCO.2015.7282362.
Prabu & Sundar, K.J.A. (2023) Enhanced attention-based encoder-decoder framework for text recognition. Intelligent Automation & Soft Computing, 35(2), 2071-2086.
Guan, T., Shen, W., Yang, X., Feng, Q., Jiang, Z., & Yang, X. (2023). Self-supervised character-to-character distillation for text recognition. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 19473-19484. https://doi.org/10.48550/arXiv.2211.00288.
Zhou, L., Wang, L., Ge X., Shi, Q. (2010). A clustering-Based KNN improved algorithm CLKNN for text classification. Proceedings of the 2nd International Asia Conference on Informatics in Control, Automation and Robotics (CAR 2010), Wuhan, China, 212-215. https://doi.org/10.1109/CAR.2010.5456668.
Alzoubi, Y., Topcu, A., & Erkaya, A. (2023). Machine learning-based text classification comparison: Turkish language context. Applied Sciences, 13(16), 9428. https://doi.org/10.3390/app13169428.
Jing, Y., Gou, H., & Zhu, Y. (2013). An improved density-based method for reducing training data in KNN. Proceedings of the International Conference on Computational and Information Sciences, Shiyang, China, 972-975. https://doi.org/10.1109/ICCIS.2013.261.
Gowda, D.K., & Kanchana, V. (2022). Kannada handwritten character recognition and classification through OCR using hybrid machine learning techniques. Proceedings of the IEEE International Conference on Data Science and Information System (ICDSIS), Hassan, India, 1-6. https://doi.org/10.1109/ICDSIS55133.2022.9915906.
Nahar, K.M.O., Alsmadi, I., Al Mamlook, R.E., Nasayreh, A., Gharaibeh, H., Almuflih A.S., & Alasim, F. (2023). Recognition of Arabic air-written letters: machine learning, Convolutional Neural Networks, and Optical Character Recognition (OCR) Techniques, Sensors, 23(23), 9475. https://doi.org/10.3390/s23239475.
Gaikwad, R., Mulchandani M., & Thakur R. (2024). Review On Text Classification Using Improved Deep Learning Models. Proceedings of the 2nd International Conference on Computer, Communication and Control (IC4), Indore, India, 1-5. https://doi.org/10.1109/IC457434.2024.10486233.
Çelik, A. (2021). Eğik karakter tanıma başarısını arttırmak için yeni bir yöntemin kullanılması. Harran Üniversitesi Mühendislik Dergisi, 6(1), 1-11. https://doi.org/10.46578/humder.720001.
Çelik, A. (2020). Optik karakter tanımada hata yayılım algoritmalarının performans kıyaslaması. Journal of the Institute of Science and Technology, 10(4), 2328-2340. https://doi.org/10.21597/jist.714810.
Lahmer, H., Oueslati, A.E., & Lachiri, Z. (2019). DNA Microarray analysis using machine learning to recognize cell cycle regulated genes. Proceedings of the 2019 International Conference on Control, Automation and Diagnosis (ICCAD), Grenoble, France, 1-5. https://doi.org/10.1109/ICCAD46983.2019.9037868.
Sasikala, B.S., Biju, V.G., & Prashanth, C.M. (2017). Kappa and accuracy evaluations of machine learning classifiers. Proceedings of the 2nd IEEE International Conference on Recent Trends in Electronics, Information & Communication Technology (RTEICT), Bangalore, India, 20-23. https://doi.org/10.1109/RTEICT.2017.8256551.
Ghafari, R., & Mansouri, N. (2025). Reinforcement learning-based solution for resource management in fog computing: A comprehensive survey. Expert Systems with Applications, 276, 127214. https://doi.org/10.1016/j.eswa.2025.127214
Çelik, A., & Demirel, S. (2023). Enhanced pneumonia diagnosis using chest X-Ray image features and Multilayer Perceptron and k-NN machine learning algorithms. Traitement du Signal, 40(3), 1015-1023. https://doi.org/10.18280/ts.400317.
Khan, M.M.R., Arif, R.B., Siddique, M.A.B. & Oishe, M.R. (2018). Study and Observation of the variation of accuracies of KNN, SVM, LMNN, ENN algorithms on eleven different datasets from UCI Machine Learning Repository. Proceedings of the 4th International Conference on Electrical Engineering and Information & Communication Technology (iCEEiCT), Dhaka, Bangladesh, 124-129. https://doi.org/10.1109/CEEICT.2018.8628041.
Zhang, Q., Liu, F., & Song, W. (2025). IMTLM-Net: improvedmulti-task transformer based on localization mechanism network for handwritten English text recognition. Complex &Intelligent Systems, 11, 125-143. https://doi.org/10.1007/s40747-024-01713-8
Lu, S., Tong, W., & Chen, Z. (2015). Implementation of the KNN algorithm based on Hadoop. Proceedings of the International Conference on Smart and Sustainable City and Big Data (ICSSC), Shanghai, China, 123-126. https://doi.org/10.1049/cp.2015.0265.
Panhwar, M.A., Memon, K.A., Abro, A. Zhongliang, D., Khuhro, S.A., & Memon, S. (2019). Signboard detection and text recognition using Artificial Neural Networks. Proceedings of the IEEE 9th International Conference on Electronics Information and Emergency Communication (ICEIEC), Beijing, China, 16-19. https://doi.org/10.1109/ICEIEC.2019.8784625
Liu, C., Yang, Qin, H.B., Zhu, X., Liu, C.L., & Yin X.C. (2022). Towards open-set text recognition via label-to-prototype learning. Pattern Recognition, 134, 109109. https://doi.org/0.1016/j.patcog.2022.109109.
Shiferaw, N.A., Mayaluri, Z.L., Sahoo, P.K., Panda, G. Jain, P., Rath, A., Islam, S., & Islam, M.T. (2025). Handwritten Amharic Character Recognition Through Transfer Learning: Integrating CNN Models and Machine Learning Classifiers. IEEE Access, 13, 52134-52148. https://doi.org/10.1109/ACCESS.2025.3553199
Im, S.-K. & Chan, K.-H. (2023). Study of small corpus-based NMT for image-based text recognition. Proceedings of the 9th International Conference on Advanced Computing and Communication Systems (ICACCS), Coimbatore, India, 1497-1501. https://doi.org/10.1109/ICACCS57279.2023.10112894.
Kang, Y. Cai, Z., Tan, C-W., Huang, Q., & Liu, H. (2020). Natural language processing (NLP) in management research: A literature review. Journal of Management Analytics. 7(2), 1-34. https://doi.org/10.1080/23270012.2020.1756939
Gülbandılar, E., Kızıltepe, S., & Yaylak, F. (2023). Pubmed platformunda cerrahi alanında yayınlanmış makalelerin metin madenciliği teknikleri ile incelenmesi. Journal of ESTUDAM Information 4(1), 24-28. https://doi.org/10.53608/estudambilisim.1224150.
Bailly, A., Blanc, C., Francis, E. Guillotin, T., Jamal, F., Wakim, B., & Roy, P. (2022). Effects of dataset size and interactions on the prediction performance of logistic regression and deep learning models, Computer Methods and Programs in Biomedicine, 213, 0169-2607, https://doi.org/10.1016/j.cmpb.2021.106504.
Maza, D., Ojo, J. O., & Akinlade, G. O. (2024). A predictive machine learning framework for diabetes. Turkish Journal of Engineering, 8(3), 583-592. https://doi.org/10.31127/tuje.1434305.
Zeybek, M. (2021). Classification of UAV point clouds by random forest machine learning algorithm. Turkish Journal of Engineering, 5(2), 48-57. https://doi.org/10.31127/tuje.669566
Alcantara Suarez, E.J. & Monzon Baeza, V. (2023). Evaluating the role of machine learning in defense applications and industry machine learning and knowledge extraction 5(4), 1557-1569, https://doi.org/10.3390/make5040078.
Bhadauria, A.P.S., Singh M., Kumar, R., & Kumar, A., (2025). Real Time Intrusion Detection In Edge Computing Using Machine Learning Techniques. Turkish Journal of Engineering, 9 (2), 385-393.
Börekci, A., & Sevli, O. (2023). A classification study for Turkish folk music makam recognition using machine learning with data augmentation techniques. Neural Computing and Applications, 1-19, https://doi.org/10.1007/s00521-023-09177-6.
Demirtop, A., & Sevli, O. (2024). Wind speed prediction using LSTM and ARIMA time series analysis models: A case study of Gelibolu. Turkish Journal of Engineering, 8(3), 524-536. https://doi.org/10.31127/tuje.1431629.
Pallathadka, H., Wenda, A., Ramirez-Asis, E. Asís-López, M., Flores-Albornoz, J., & Phasinam, K. (2023). Classification and prediction of student performance data using various machine learning algorithms. Materials Today: Proceedings, 80, 3782-3785, https://doi.org/10.1016/j.matpr.2021.07.382.
Sıngh, S., Kumar, K., & Kumar, B. (2024). Analysis of feature extraction techniques for sentiment analysis of tweets. Turkish Journal of Engineering, 8(4), 741-753. https://doi.org/10.31127/tuje.1477502.
Çelik, A. (2023). Determination of the Classification Success of KNN Algorithm Distance Metric Methods on Wheat Seeds Dataset. Afyon Kocatepe Üniversitesi Fen Ve Mühendislik Bilimleri Dergisi, 23(5),1142-1149. https://doi.org/10.35414/akufemubid.1263900.
Google Drive. Google Documents. https://www.google.com/intl/tr_tr/drive/
Google Bard. Bard Google's conversational AI tool-Google Bard. https://bard.google.com
Lertsawatwicha, P., Phathong, P., Tantasanee, N., Sarawutthinun, K., & Siriborvornratanakul, T. (2023). A novel stock counting system for detecting lot numbers using Tesseract OCR. Int. j. inf. Tecnol, 15, 393-398, https://doi.org/10.1007/s41870-022-01107-4.
Liu, L., Zhao, G., & Liang, W. (2023). Slope Stability Prediction Using k-NN-based optimum-path forest approach. Mathematics, 11(14), 3071. https://doi.org/10.3390/math11143071.
Sasikala, B.S., Biju, V.G., & Prashanth, C.M. (2017). Kappa and accuracy evaluations of machine learning classifiers. Proceedings of the 2nd IEEE International Conference on Recent Trends in Electronics, Information & Communication Technology (RTEICT), Bangalore, India, 20-23. https://doi.org/10.1109/RTEICT.2017.8256551.
Kurani, A., Doshi, P., Vakharia, A., & Shah, M. (2023). A comprehensive comparative study of Artificial Neural Network (ANN) and Support Vector Machines (SVM) on stock forecasting. Annals of Data Science, 10,183-208, https://doi.org/10.1007/s40745-021-00344-x.
Bartosik A. & Whittingham, H. (2021). The era of artificial intelligence. Machine Learning and Data Science in the Pharmaceutical Industry. Academic Press. https://doi.org/10.1016/B978-0-12-820045-2.00008-8.
Hossin, M. & Sulaiman, M.N. (2015). A review on evaluation metrics for Data Classification Evaluations. International Journal of Data Mining & Knowledge Management Process, 5(2), 01-11. https://doi.org/10.5121/ijdkp.2015.5201.
Vujovic, Z. (2021). Classification model evaluation metrics. International Journal of Advanced Computer Science and Applications, 12(6), 599-606. https://doi.org/10.14569/IJACSA.2021.0120670.
Tharwat, A. (2021). Classification assessment methods. Applied Computing and Informatics, 17(1), 168-192. https://doi.org/10.1016/j.aci.2018.08.003.
Foody, G.M. (2023). Challenges in the real world use of classification accuracy metrics: From recall and precision to the Matthews correlation coefficient. PLOS ONE, 18(10), 1-27. https://doi.org/10.1371/journal.pone.0291908.
Çelik, A. (2023). Buğday tohumu sınıflandırmasının karar ağacı algoritmasıyla gerçekleştirilmesi ve değişken eğitim verisine göre başarı kıyaslaması. International Journal of Advanced Natural Sciences and Engineering Researches, 7(11), 44-48.
Kampstra, P. (2008). Beanplot: A boxplot alternative for visual comparisonof distributions. Journal of statistical software 28(1), 1-9.
Abt, M., Loibl, K., Leuders, T., Dooren, W.V., & Reinhold, F. Understanding student errors in comparing data sets with boxplots. Educ Stud Math (2025). https://doi.org/10.1007/s10649-025-10387-z
Majaw, N., & Ahmed, S. S. (2023). Exploring data distributions using box and whisker plot analysis. Proceedings of the 14th International Conference on Computing Communication and Networking Technologies (ICCCNT), Delhi, India, 1–8. https://doi.org/10.1109/ICCCNT56998.2023.10308191.
Ross, S. M. (2014). Introduction to probability and statistics for engineers and scientists. Academic Press.
Chakrabarty, D. (2018). Generalized fG - mean: derivation of various formulations of average. American Journal of Computation, Communication and Control, 5(3),101-108.
Chakrabarty, D. (2021). Four formulations of average derived from pythagorean means. International Journal of Mathematics Trends and Technology 67, 97-118. https://doi.org/10.14445/22315373/IJMTT-V67I6P512.
Rahman, M. M., Aktaruzzaman Pramanik, M., Sadik, R., Roy, M., & Chakraborty, P. (2020). Bangla documents classification using transformer based deep learning models. Proceedings of the 2nd International Conference on Sustainable Technologies for Industry 4.0 (STI), Dhaka, Bangladesh, 1–5. https://doi.org/10.1109/STI50764.2020.9350394.
Alomari, A., Idris, N., Qalid, A., & Alsmadi, I. (2023). Improving Coverage and Novelty of Abstractive Text Summarization Using Transfer Learning and Divide and Conquer Approaches. Malaysian Journal of Computer Science, 36(3), 271–288. https://doi.org/10.22452/mjcs.vol36no3.4.
Meng, F., & Ghena, B. (2023). Research on text recognition methods based on artificial intelligence and machine learning. Advances in Computer and Communication, 4(5), 340-344, http://dx.doi.org/10.26855/acc.2023.10.014.
Hadjadj, H., & Sayoud, H. (2021). Arabic authorship attribution using Synthetic Minority Over-Sampling Technique and principal components analysis for imbalanced documents. International Journal of Cognitive informatics and natural intelligence (IJCINI), 15(4), 1-17. http://doi.org/10.4018/IJCINI.20211001.oa33
Meng, F., & Wang, C. A. (2024). Artificial Intelligence and Machine Learning Approaches to Text Recognition: A Research Overview. Journal of Mathematical Techniques and Computational Mathematics, 3(3), 01-05.

There are 57 citations in total.

Details

Primary Language	English
Subjects	Information Systems (Other)
Journal Section	Articles
Authors	Ahmet Çelik 0000-0002-6288-3182 Deniz Kaptan 0000-0002-6055-5038
Publication Date	October 8, 2025
Submission Date	June 12, 2025
Acceptance Date	September 17, 2025
Published in Issue	Year 2025 Volume: 9 Issue: 4

Cite

APA	Çelik, A., & Kaptan, D. (2025). Text classification by machine learning algorithms using a new text feature extraction method based on image processing. Turkish Journal of Engineering, 9(4), 712-724. https://doi.org/10.31127/tuje.1718023
AMA	Çelik A, Kaptan D. Text classification by machine learning algorithms using a new text feature extraction method based on image processing. TUJE. October 2025;9(4):712-724. doi:10.31127/tuje.1718023
Chicago	Çelik, Ahmet, and Deniz Kaptan. “Text Classification by Machine Learning Algorithms Using a New Text Feature Extraction Method Based on Image Processing”. Turkish Journal of Engineering 9, no. 4 (October 2025): 712-24. https://doi.org/10.31127/tuje.1718023.
EndNote	Çelik A, Kaptan D (October 1, 2025) Text classification by machine learning algorithms using a new text feature extraction method based on image processing. Turkish Journal of Engineering 9 4 712–724.
IEEE	A. Çelik and D. Kaptan, “Text classification by machine learning algorithms using a new text feature extraction method based on image processing”, TUJE, vol. 9, no. 4, pp. 712–724, 2025, doi: 10.31127/tuje.1718023.
ISNAD	Çelik, Ahmet - Kaptan, Deniz. “Text Classification by Machine Learning Algorithms Using a New Text Feature Extraction Method Based on Image Processing”. Turkish Journal of Engineering 9/4 (October2025), 712-724. https://doi.org/10.31127/tuje.1718023.
JAMA	Çelik A, Kaptan D. Text classification by machine learning algorithms using a new text feature extraction method based on image processing. TUJE. 2025;9:712–724.
MLA	Çelik, Ahmet and Deniz Kaptan. “Text Classification by Machine Learning Algorithms Using a New Text Feature Extraction Method Based on Image Processing”. Turkish Journal of Engineering, vol. 9, no. 4, 2025, pp. 712-24, doi:10.31127/tuje.1718023.
Vancouver	Çelik A, Kaptan D. Text classification by machine learning algorithms using a new text feature extraction method based on image processing. TUJE. 2025;9(4):712-24.

Article Files

Full Text