Stylometric Profiling of Turkish Texts: Joint Estimation of Author, Region, Age and Genre

Vecdi Emre Levent; Uğur Özbalkan

doi:10.29130/dubited.1728460

Review Article

Türkçe Metinlerde Stilometrik Profilleme: Yazar, Bölge, Yaş ve Türün Eşzamanlı Tahmini

Year 2026, Volume: 14 Issue: 1, 288 - 298, 21.01.2026

Vecdi Emre Levent , Uğur Özbalkan

https://doi.org/10.29130/dubited.1728460

Abstract

Her yazarın kendine özgü yazım (stil) özellikleri vardır. Bu özellikler bölgeye, yaşa, içerik türüne ve benzeri faktörlere göre değişiklik gösterebilir. Bazı durumlarda, bir metni kimin yazdığını, yazarın hangi yaş aralığında bulunduğunu veya yazımına hangi bölgenin etki ettiğini belirlemek önemlidir. Yazar tanıma (authorship identification), özellikle intihal tespiti uygulamalarında, bir yazarın yazımının beklenen özelliklerden sapıp sapmadığını belirlemek için yaygın olarak kullanılan bir yöntemdir.

Received: …/…/2025, Revised: …/…/2025, Accepted: …/…/2025

Bu çalışmada, köşe yazarlarının yazarlık özellikleri Yapay Sinir Ağları, Destek Vektör Makineleri ve karar ağacı algoritmaları (J48 ve Random Forest) ile değerlendirilmiştir. On altı farklı özellik yazarlık göstergesi olarak kullanılmıştır. Deneyler için altı farklı veri seti oluşturulmuş ve sonuçlar değerlendirilmiştir. Bölgeye göre yazar tanımlamada doğruluk oranı %73, haber türüne göre başarı oranı %55 ve yaşa göre başarı oranı %62,5 olarak elde edilmiştir. Geliştirilen uygulama, farklı sınıf sayıları için oluşturulan özellik kümelerini ARFF dosya formatında kaydederken, yazar tanıma işlemini Yapay Sinir Ağları kullanarak gerçekleştirmektedir.

Keywords

Yazar Tanıma , Stilometri , Yapay Sinir Ağı , Destek Vektör Makinesi , Karar Ağaçları , Yazarlık Özellikleri

References

Akın, A. A., & Akın, M. D. (2007). Zemberek: An open source NLP framework for Turkic languages. Structure, 10, 1-5. https://www.academia.edu/download/34521696/zemberek_makale.pdf
Amasyalı, M. F., & Yıldırım, T. (2004). Otomatik haber metinleri sınıflandırma. In Proceedings of the IEEE 12th Signal Processing and Communications Applications Conference (SIU 2004) (pp. 224–226). IEEE.
Amasyalı, M. F., & Diri, B. (2006). Automatic Turkish text categorization in terms of author, genre and gender. In International Conference on Application of Natural Language to Information Systems (pp. 221-226). Springer. https://doi.org/10.1007/11765448_22
Amasyalı, M. F., Diri, B., & Türkoğlu, F. (2006). Farklı özellik vektörleri ile Türkçe dokümanların yazarlarının belirlenmesi. In 15th Turkish Symposium on Artificial Intelligence and Neural Networks (pp. 1–4). Muğla, Türkiye.
Arora, R., & Suman, S. (2012). Comparative analysis of classification algorithms on different datasets using WEKA. International Journal of Computer Applications, 54(13), 21-25.
Birant, D. (2011). Comparison of decision tree algorithms for predicting potential air pollutant emissions with data mining models. Journal of Environmental Informatics, 17(1), 46–53. https://doi.org/10.3808/jei.201100186
Bhargava, N., Sharma, S., Purohit, R., & Rathore, P. S. (2017). Prediction of recurrence cancer using J48 algorithm. In Proceedings of the 2017 2nd International Conference on Communication and Electronics Systems (ICCES 2017) (pp. 386–390). IEEE. https://doi.org/10.1109/CESYS.2017.8321306
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32. https://doi.org/10.1023/A:1010933404324
Canbay, P., Sezer, E. A., & Sever, H. (2020). Deep combination of stylometry features for authorship analysis. International Journal of Information Security Science, 9(3), 154-163.
Chandrasekar, P., Qian, K., Shahriar, H., & Bhattacharya, P. (2017). Improving the prediction accuracy of decision tree mining with data preprocessing. In 2017 IEEE 41st Annual Computer Software and Applications Conference (COMPSAC) (Vol. 2, pp. 481-484). IEEE. https://doi.org/10.1109/COMPSAC.2017.146
Clarin, J. A. (2020). J48-based algorithm in predicting the success rate in the board examination. International Journal, 9(1.3), 122-127. https://doi.org/10.30534/ijatcse/2020/1791.32020
Clark, J. H., & Hannon, C. J. (2007). A classifier system for author recognition using synonym-based features. In Mexican International Conference on Artificial Intelligence (pp. 839-849). Springer. https://doi.org/10.1007/978-3-540-76631-5_80
Çatal, Ç., Erbakırcı, K., & Erenler, Y. (2003). Computer-based authorship attribution for Turkish documents. In Turkish Symposium on Artificial Intelligence and Neural Networks (pp. 539-541).
Elmas, Ç. (2003). Yapay sinir ağları. Seçkin Yayınevi.
García-Laencina, P. J., Sancho-Gómez, J. L., & Figueiras-Vidal, A. R. (2010). Pattern classification with missing data: A review. Neural Computing and Applications, 19(2), 263-282. https://doi.org/10.1007/s00521-009-0295-6
Horning, N. (2010). Random Forests: An algorithm for image classification and generation of continuous fields data sets. In Proceedings of the International Conference on Geoinformatics for Spatial Infrastructure Development in Earth and Allied Sciences (pp. 1–6). Osaka, Japan.
Kliegr, T., Bahník, Š., & Fürnkranz, J. (2020). Advances in machine learning for the behavioral sciences. American Behavioral Scientist, 64(2), 145-175. https://doi.org/10.1177/0002764219859639
Levent, V. E., & Diri, B. (2014). Türkçe dokümanlarda yapay sinir ağları ile yazar tanıma. In Akademik Bilişim 2014 Bildiriler Kitabı, (pp. 735-741).
Okulska, I., Stetsenko, D., Kołos, A., Karlińska, A., Głąbińska, K., & Nowakowski, A. (2023). Stylometrix: An open-source multilingual tool for representing stylometric vectors. arXiv. https://arxiv.org/abs/2309.12810
Remaida, A., Moumen, A., El Idrissi, Y. E. B., & Sabri, Z. (2020). Handwriting recognition with artificial neural networks a decade literature review. Proceedings of the 3rd International Conference on Networking, Information Systems & Security (Article 65, pp. 1–5). https://doi.org/10.1145/3386723.3387884
Rocha, M. A. D., Nóbrega, G. Â. S. D., de Medeiros Valentim, R. A., & Alves, L. P. C. (2020). A text as unique as fingerprint: AVASUS text analysis and authorship recognition. In Euro-American Conference on Telematics and Information Systems (pp. 1-8). ACM.
Rokach, L. (2016). Decision forest: Twenty years of research. Information Fusion, 27, 111-125. https://doi.org/10.1016/j.inffus.2015.06.005
Sağiroğlu, Ş., Erler, M., & Beşdok, E. (2003). Mühendislikte yapay zeka uygulamaları-I: Yapay sinir ağları. Ufuk Kitabevi.
Saravanan, N., & Gayathri, V. (2018). Performance and classification evaluation of J48 algorithm and Kendall’s based J48 algorithm (KNJ48). International Journal of Computer Trends and Technology, 59(1), 73-80. https://doi.org/10.14445/22312803/IJCTT-V59P112
Sas, J. (2006). Handwriting recognition accuracy improvement by author identification. In International Conference on Artificial Intelligence and Soft Computing (pp. 682-691). Springer.
Schifano, S. F., Sgarbanti, T., & Tomassetti, L. (2018). Authorship recognition and disambiguation of scientific papers using a neural networks approach. Proceedings of Science, 327, 007. https://doi.org/10.22323/1.327.0007.
Selman, S. (2012). Distinction of the authors of texts using multilayered feedforward neural networks. Southeast Europe Journal of Soft Computing, 1(1), 128-138.
Selman, S., & Husagic-Selman, A. (2011). Multilayered feedforward neural networks as a tool for distinction of the authors of texts. In International Symposium on Information, Communication and Automation Technologies (pp. 1-6). IEEE.
Sharma, N., & Kumar, A. (2024). Deep learning for stylometry and authorship attribution: A review of literature. International Journal for Research in Applied Science and Engineering Technology, 12(9), 212-215. https://doi.org/10.22214/ijraset.2024.64168
Singaravelan, S., Murugan, D., & Mayakrishnan, R. (2015). Analysis of classification algorithms J48 and Smo on different datasets. World Engineering & Applied Sciences Journal, 6(2), 119-123.
Škorić, M., Stanković, R., Ikonić Nešić, M., Byszuk, J., & Eder, M. (2022). Parallel stylometric document embeddings with deep learning based language models in literary authorship attribution. Mathematics, 10(5), Article 838. https://doi.org/10.3390/math10050838
Stanczyk, U., & Cyran, K. A. (2008). Application of artificial neural networks to stylometric analysis. In Proceedings of the 8th Conference on Systems Theory and Scientific Computation (pp. 25-30). https://dl.acm.org/doi/abs/10.5555/1503773.1503779
Svetnik, V., Liaw, A., Tong, C., Culberson, J. C., Sheridan, R. P., & Feuston, B. P. (2003). Random forest: A classification and regression tool for compound classification and QSAR modeling. Journal of Chemical Information and Computer Sciences, 43(6), 1947-1958. https://doi.org/10.1021/ci034160g
Tun, M. T., & Htay, Y. Y. (2020). Predict Students’ Performance by Using J48 Algorithm. International Journal of Scientific Research in Science, Engineering and Technology, 7(3), 578-582. https://doi.org/10.32628/IJSRSET2073124
Umeda, M., Miyoshi, T., & Misaki, K. (2002). Writer identification and verification using autoassociative neural networks. IEEJ Transactions on Electronics, Information and Systems, 122(11), 1869-1875. https://doi.org/10.1541/ieejeiss1987.122.11_1869
van Halteren, H. (2004). Linguistic profiling for authorship recognition and verification. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04) (pp. 199-206). Association for Computational Linguistics. https://doi.org/10.3115/1218955.1218981
Weka. (n.d.). Weka 3: Data mining software in Java. University of Waikato. Retrieved May 2025, from http://www.cs.waikato.ac.nz/ml/weka/

Stylometric Profiling of Turkish Texts: Joint Estimation of Author, Region, Age and Genre

Year 2026, Volume: 14 Issue: 1, 288 - 298, 21.01.2026

Vecdi Emre Levent , Uğur Özbalkan

https://doi.org/10.29130/dubited.1728460

Abstract

Authorship identification seeks to determine the writer of a text by analyzing distinctive linguistic and stylistic features. These characteristics may vary across dimensions such as region, age, and genre. Identifying an author’s stylistic fingerprint is essential in plagiarism detection, digital forensics, and computational linguistics. In this study, the authorship features of Turkish columnists were analyzed using Artificial Neural Networks (ANN), Support Vector Machines (SVM), and decision tree algorithms (J48 and Random Forest). Sixteen stylometric indicators were selected through the Zemberek natural language processing library and evaluated across six distinct datasets. The proposed system allows flexible parameter adjustment through a graphical interface and exports results in ARFF format for reproducibility. Experimental results demonstrated that Random Forest achieved the highest overall accuracy, particularly in regional and age-based datasets, with F-measures reaching up to 0.91. The accuracy rates were 73% for regional classification, 55% for genre classification, and 62.5% for age-based classification. The findings confirm that combining statistical learning with stylometric analysis provides a robust framework for Turkish authorship attribution, paving the way for future studies employing deep learning and transformer-based models.

Keywords

Author recognition , Stylometry , Artificial neural network , Support vector machine , Decision trees , Authorship features

Ethical Statement

This study does not involve human or animal participants. All procedures followed scientific and ethical principles, and all referenced studies are appropriately cited.

Supporting Institution

This research received no external funding.

Thanks

The authors do not wish to acknowledge any individual or institution.

References

Akın, A. A., & Akın, M. D. (2007). Zemberek: An open source NLP framework for Turkic languages. Structure, 10, 1-5. https://www.academia.edu/download/34521696/zemberek_makale.pdf
Amasyalı, M. F., & Yıldırım, T. (2004). Otomatik haber metinleri sınıflandırma. In Proceedings of the IEEE 12th Signal Processing and Communications Applications Conference (SIU 2004) (pp. 224–226). IEEE.
Amasyalı, M. F., & Diri, B. (2006). Automatic Turkish text categorization in terms of author, genre and gender. In International Conference on Application of Natural Language to Information Systems (pp. 221-226). Springer. https://doi.org/10.1007/11765448_22
Amasyalı, M. F., Diri, B., & Türkoğlu, F. (2006). Farklı özellik vektörleri ile Türkçe dokümanların yazarlarının belirlenmesi. In 15th Turkish Symposium on Artificial Intelligence and Neural Networks (pp. 1–4). Muğla, Türkiye.
Arora, R., & Suman, S. (2012). Comparative analysis of classification algorithms on different datasets using WEKA. International Journal of Computer Applications, 54(13), 21-25.
Birant, D. (2011). Comparison of decision tree algorithms for predicting potential air pollutant emissions with data mining models. Journal of Environmental Informatics, 17(1), 46–53. https://doi.org/10.3808/jei.201100186
Bhargava, N., Sharma, S., Purohit, R., & Rathore, P. S. (2017). Prediction of recurrence cancer using J48 algorithm. In Proceedings of the 2017 2nd International Conference on Communication and Electronics Systems (ICCES 2017) (pp. 386–390). IEEE. https://doi.org/10.1109/CESYS.2017.8321306
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32. https://doi.org/10.1023/A:1010933404324
Canbay, P., Sezer, E. A., & Sever, H. (2020). Deep combination of stylometry features for authorship analysis. International Journal of Information Security Science, 9(3), 154-163.
Chandrasekar, P., Qian, K., Shahriar, H., & Bhattacharya, P. (2017). Improving the prediction accuracy of decision tree mining with data preprocessing. In 2017 IEEE 41st Annual Computer Software and Applications Conference (COMPSAC) (Vol. 2, pp. 481-484). IEEE. https://doi.org/10.1109/COMPSAC.2017.146
Clarin, J. A. (2020). J48-based algorithm in predicting the success rate in the board examination. International Journal, 9(1.3), 122-127. https://doi.org/10.30534/ijatcse/2020/1791.32020
Clark, J. H., & Hannon, C. J. (2007). A classifier system for author recognition using synonym-based features. In Mexican International Conference on Artificial Intelligence (pp. 839-849). Springer. https://doi.org/10.1007/978-3-540-76631-5_80
Çatal, Ç., Erbakırcı, K., & Erenler, Y. (2003). Computer-based authorship attribution for Turkish documents. In Turkish Symposium on Artificial Intelligence and Neural Networks (pp. 539-541).
Elmas, Ç. (2003). Yapay sinir ağları. Seçkin Yayınevi.
García-Laencina, P. J., Sancho-Gómez, J. L., & Figueiras-Vidal, A. R. (2010). Pattern classification with missing data: A review. Neural Computing and Applications, 19(2), 263-282. https://doi.org/10.1007/s00521-009-0295-6
Horning, N. (2010). Random Forests: An algorithm for image classification and generation of continuous fields data sets. In Proceedings of the International Conference on Geoinformatics for Spatial Infrastructure Development in Earth and Allied Sciences (pp. 1–6). Osaka, Japan.
Kliegr, T., Bahník, Š., & Fürnkranz, J. (2020). Advances in machine learning for the behavioral sciences. American Behavioral Scientist, 64(2), 145-175. https://doi.org/10.1177/0002764219859639
Levent, V. E., & Diri, B. (2014). Türkçe dokümanlarda yapay sinir ağları ile yazar tanıma. In Akademik Bilişim 2014 Bildiriler Kitabı, (pp. 735-741).
Okulska, I., Stetsenko, D., Kołos, A., Karlińska, A., Głąbińska, K., & Nowakowski, A. (2023). Stylometrix: An open-source multilingual tool for representing stylometric vectors. arXiv. https://arxiv.org/abs/2309.12810
Remaida, A., Moumen, A., El Idrissi, Y. E. B., & Sabri, Z. (2020). Handwriting recognition with artificial neural networks a decade literature review. Proceedings of the 3rd International Conference on Networking, Information Systems & Security (Article 65, pp. 1–5). https://doi.org/10.1145/3386723.3387884
Rocha, M. A. D., Nóbrega, G. Â. S. D., de Medeiros Valentim, R. A., & Alves, L. P. C. (2020). A text as unique as fingerprint: AVASUS text analysis and authorship recognition. In Euro-American Conference on Telematics and Information Systems (pp. 1-8). ACM.
Rokach, L. (2016). Decision forest: Twenty years of research. Information Fusion, 27, 111-125. https://doi.org/10.1016/j.inffus.2015.06.005
Sağiroğlu, Ş., Erler, M., & Beşdok, E. (2003). Mühendislikte yapay zeka uygulamaları-I: Yapay sinir ağları. Ufuk Kitabevi.
Saravanan, N., & Gayathri, V. (2018). Performance and classification evaluation of J48 algorithm and Kendall’s based J48 algorithm (KNJ48). International Journal of Computer Trends and Technology, 59(1), 73-80. https://doi.org/10.14445/22312803/IJCTT-V59P112
Sas, J. (2006). Handwriting recognition accuracy improvement by author identification. In International Conference on Artificial Intelligence and Soft Computing (pp. 682-691). Springer.
Schifano, S. F., Sgarbanti, T., & Tomassetti, L. (2018). Authorship recognition and disambiguation of scientific papers using a neural networks approach. Proceedings of Science, 327, 007. https://doi.org/10.22323/1.327.0007.
Selman, S. (2012). Distinction of the authors of texts using multilayered feedforward neural networks. Southeast Europe Journal of Soft Computing, 1(1), 128-138.
Selman, S., & Husagic-Selman, A. (2011). Multilayered feedforward neural networks as a tool for distinction of the authors of texts. In International Symposium on Information, Communication and Automation Technologies (pp. 1-6). IEEE.
Sharma, N., & Kumar, A. (2024). Deep learning for stylometry and authorship attribution: A review of literature. International Journal for Research in Applied Science and Engineering Technology, 12(9), 212-215. https://doi.org/10.22214/ijraset.2024.64168
Singaravelan, S., Murugan, D., & Mayakrishnan, R. (2015). Analysis of classification algorithms J48 and Smo on different datasets. World Engineering & Applied Sciences Journal, 6(2), 119-123.
Škorić, M., Stanković, R., Ikonić Nešić, M., Byszuk, J., & Eder, M. (2022). Parallel stylometric document embeddings with deep learning based language models in literary authorship attribution. Mathematics, 10(5), Article 838. https://doi.org/10.3390/math10050838
Stanczyk, U., & Cyran, K. A. (2008). Application of artificial neural networks to stylometric analysis. In Proceedings of the 8th Conference on Systems Theory and Scientific Computation (pp. 25-30). https://dl.acm.org/doi/abs/10.5555/1503773.1503779
Svetnik, V., Liaw, A., Tong, C., Culberson, J. C., Sheridan, R. P., & Feuston, B. P. (2003). Random forest: A classification and regression tool for compound classification and QSAR modeling. Journal of Chemical Information and Computer Sciences, 43(6), 1947-1958. https://doi.org/10.1021/ci034160g
Tun, M. T., & Htay, Y. Y. (2020). Predict Students’ Performance by Using J48 Algorithm. International Journal of Scientific Research in Science, Engineering and Technology, 7(3), 578-582. https://doi.org/10.32628/IJSRSET2073124
Umeda, M., Miyoshi, T., & Misaki, K. (2002). Writer identification and verification using autoassociative neural networks. IEEJ Transactions on Electronics, Information and Systems, 122(11), 1869-1875. https://doi.org/10.1541/ieejeiss1987.122.11_1869
van Halteren, H. (2004). Linguistic profiling for authorship recognition and verification. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04) (pp. 199-206). Association for Computational Linguistics. https://doi.org/10.3115/1218955.1218981
Weka. (n.d.). Weka 3: Data mining software in Java. University of Waikato. Retrieved May 2025, from http://www.cs.waikato.ac.nz/ml/weka/

There are 37 citations in total.

Details

Primary Language	English
Subjects	Classification Algorithms
Journal Section	Review Article
Authors	Vecdi Emre Levent 0000-0001-6886-8875 Uğur Özbalkan 0000-0003-0440-5390
Submission Date	June 30, 2025
Acceptance Date	November 3, 2025
Publication Date	January 21, 2026
Published in Issue	Year 2026 Volume: 14 Issue: 1

Cite

APA	Levent, V. E., & Özbalkan, U. (2026). Stylometric Profiling of Turkish Texts: Joint Estimation of Author, Region, Age and Genre. Duzce University Journal of Science and Technology, 14(1), 288-298. https://doi.org/10.29130/dubited.1728460
AMA	Levent VE, Özbalkan U. Stylometric Profiling of Turkish Texts: Joint Estimation of Author, Region, Age and Genre. DUBİTED. January 2026;14(1):288-298. doi:10.29130/dubited.1728460
Chicago	Levent, Vecdi Emre, and Uğur Özbalkan. “Stylometric Profiling of Turkish Texts: Joint Estimation of Author, Region, Age and Genre”. Duzce University Journal of Science and Technology 14, no. 1 (January 2026): 288-98. https://doi.org/10.29130/dubited.1728460.
EndNote	Levent VE, Özbalkan U (January 1, 2026) Stylometric Profiling of Turkish Texts: Joint Estimation of Author, Region, Age and Genre. Duzce University Journal of Science and Technology 14 1 288–298.
IEEE	V. E. Levent and U. Özbalkan, “Stylometric Profiling of Turkish Texts: Joint Estimation of Author, Region, Age and Genre”, DUBİTED, vol. 14, no. 1, pp. 288–298, 2026, doi: 10.29130/dubited.1728460.
ISNAD	Levent, Vecdi Emre - Özbalkan, Uğur. “Stylometric Profiling of Turkish Texts: Joint Estimation of Author, Region, Age and Genre”. Duzce University Journal of Science and Technology 14/1 (January2026), 288-298. https://doi.org/10.29130/dubited.1728460.
JAMA	Levent VE, Özbalkan U. Stylometric Profiling of Turkish Texts: Joint Estimation of Author, Region, Age and Genre. DUBİTED. 2026;14:288–298.
MLA	Levent, Vecdi Emre and Uğur Özbalkan. “Stylometric Profiling of Turkish Texts: Joint Estimation of Author, Region, Age and Genre”. Duzce University Journal of Science and Technology, vol. 14, no. 1, 2026, pp. 288-9, doi:10.29130/dubited.1728460.
Vancouver	Levent VE, Özbalkan U. Stylometric Profiling of Turkish Texts: Joint Estimation of Author, Region, Age and Genre. DUBİTED. 2026;14(1):288-9.

Article Files

Full Text