Türkçe Metinlerde Stilometrik Profilleme: Yazar, Bölge, Yaş ve Türün Eşzamanlı Tahmini
Year 2026,
Volume: 14 Issue: 1, 288 - 298, 21.01.2026
Vecdi Emre Levent
,
Uğur Özbalkan
Abstract
Her yazarın kendine özgü yazım (stil) özellikleri vardır. Bu özellikler bölgeye, yaşa, içerik türüne ve benzeri faktörlere göre değişiklik gösterebilir. Bazı durumlarda, bir metni kimin yazdığını, yazarın hangi yaş aralığında bulunduğunu veya yazımına hangi bölgenin etki ettiğini belirlemek önemlidir. Yazar tanıma (authorship identification), özellikle intihal tespiti uygulamalarında, bir yazarın yazımının beklenen özelliklerden sapıp sapmadığını belirlemek için yaygın olarak kullanılan bir yöntemdir.
Received: …/…/2025, Revised: …/…/2025, Accepted: …/…/2025
Bu çalışmada, köşe yazarlarının yazarlık özellikleri Yapay Sinir Ağları, Destek Vektör Makineleri ve karar ağacı algoritmaları (J48 ve Random Forest) ile değerlendirilmiştir. On altı farklı özellik yazarlık göstergesi olarak kullanılmıştır. Deneyler için altı farklı veri seti oluşturulmuş ve sonuçlar değerlendirilmiştir. Bölgeye göre yazar tanımlamada doğruluk oranı %73, haber türüne göre başarı oranı %55 ve yaşa göre başarı oranı %62,5 olarak elde edilmiştir. Geliştirilen uygulama, farklı sınıf sayıları için oluşturulan özellik kümelerini ARFF dosya formatında kaydederken, yazar tanıma işlemini Yapay Sinir Ağları kullanarak gerçekleştirmektedir.
References
-
Akın, A. A., & Akın, M. D. (2007). Zemberek: An open source NLP framework for Turkic languages. Structure, 10, 1-5.
https://www.academia.edu/download/34521696/zemberek_makale.pdf
-
Amasyalı, M. F., & Yıldırım, T. (2004). Otomatik haber metinleri sınıflandırma. In Proceedings of the IEEE 12th Signal Processing and Communications Applications Conference (SIU 2004) (pp. 224–226). IEEE.
-
Amasyalı, M. F., & Diri, B. (2006). Automatic Turkish text categorization in terms of author, genre and gender. In International Conference on Application of Natural Language to Information Systems (pp. 221-226). Springer. https://doi.org/10.1007/11765448_22
-
Amasyalı, M. F., Diri, B., & Türkoğlu, F. (2006). Farklı özellik vektörleri ile Türkçe dokümanların yazarlarının belirlenmesi. In 15th Turkish Symposium on Artificial Intelligence and Neural Networks (pp. 1–4). Muğla, Türkiye.
-
Arora, R., & Suman, S. (2012). Comparative analysis of classification algorithms on different datasets using WEKA. International Journal of Computer Applications, 54(13), 21-25.
-
Birant, D. (2011). Comparison of decision tree algorithms for predicting potential air pollutant emissions with data mining models. Journal of Environmental Informatics, 17(1), 46–53. https://doi.org/10.3808/jei.201100186
-
Bhargava, N., Sharma, S., Purohit, R., & Rathore, P. S. (2017). Prediction of recurrence cancer using J48 algorithm. In Proceedings of the 2017 2nd International Conference on Communication and Electronics Systems (ICCES 2017) (pp. 386–390). IEEE. https://doi.org/10.1109/CESYS.2017.8321306
-
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32. https://doi.org/10.1023/A:1010933404324
-
Canbay, P., Sezer, E. A., & Sever, H. (2020). Deep combination of stylometry features for authorship analysis. International Journal of Information Security Science, 9(3), 154-163.
-
Chandrasekar, P., Qian, K., Shahriar, H., & Bhattacharya, P. (2017). Improving the prediction accuracy of decision tree mining with data preprocessing. In 2017 IEEE 41st Annual Computer Software and Applications Conference (COMPSAC) (Vol. 2, pp. 481-484). IEEE. https://doi.org/10.1109/COMPSAC.2017.146
-
Clarin, J. A. (2020). J48-based algorithm in predicting the success rate in the board examination. International Journal, 9(1.3), 122-127. https://doi.org/10.30534/ijatcse/2020/1791.32020
-
Clark, J. H., & Hannon, C. J. (2007). A classifier system for author recognition using synonym-based features. In Mexican International Conference on Artificial Intelligence (pp. 839-849). Springer. https://doi.org/10.1007/978-3-540-76631-5_80
-
Çatal, Ç., Erbakırcı, K., & Erenler, Y. (2003). Computer-based authorship attribution for Turkish documents. In Turkish Symposium on Artificial Intelligence and Neural Networks (pp. 539-541).
-
Elmas, Ç. (2003). Yapay sinir ağları. Seçkin Yayınevi.
-
García-Laencina, P. J., Sancho-Gómez, J. L., & Figueiras-Vidal, A. R. (2010). Pattern classification with missing data: A review. Neural Computing and Applications, 19(2), 263-282. https://doi.org/10.1007/s00521-009-0295-6
-
Horning, N. (2010). Random Forests: An algorithm for image classification and generation of continuous fields data sets. In Proceedings of the International Conference on Geoinformatics for Spatial Infrastructure Development in Earth and Allied Sciences (pp. 1–6). Osaka, Japan.
-
Kliegr, T., Bahník, Š., & Fürnkranz, J. (2020). Advances in machine learning for the behavioral sciences. American Behavioral Scientist, 64(2), 145-175. https://doi.org/10.1177/0002764219859639
-
Levent, V. E., & Diri, B. (2014). Türkçe dokümanlarda yapay sinir ağları ile yazar tanıma. In Akademik Bilişim 2014 Bildiriler Kitabı, (pp. 735-741).
-
Okulska, I., Stetsenko, D., Kołos, A., Karlińska, A., Głąbińska, K., & Nowakowski, A. (2023). Stylometrix: An open-source multilingual tool for representing stylometric vectors. arXiv. https://arxiv.org/abs/2309.12810
-
Remaida, A., Moumen, A., El Idrissi, Y. E. B., & Sabri, Z. (2020). Handwriting recognition with artificial neural networks a decade literature review. Proceedings of the 3rd International Conference on Networking, Information Systems & Security (Article 65, pp. 1–5). https://doi.org/10.1145/3386723.3387884
-
Rocha, M. A. D., Nóbrega, G. Â. S. D., de Medeiros Valentim, R. A., & Alves, L. P. C. (2020). A text as unique as fingerprint: AVASUS text analysis and authorship recognition. In Euro-American Conference on Telematics and Information Systems (pp. 1-8). ACM.
-
Rokach, L. (2016). Decision forest: Twenty years of research. Information Fusion, 27, 111-125. https://doi.org/10.1016/j.inffus.2015.06.005
-
Sağiroğlu, Ş., Erler, M., & Beşdok, E. (2003). Mühendislikte yapay zeka uygulamaları-I: Yapay sinir ağları. Ufuk Kitabevi.
-
Saravanan, N., & Gayathri, V. (2018). Performance and classification evaluation of J48 algorithm and Kendall’s based J48 algorithm (KNJ48). International Journal of Computer Trends and Technology, 59(1), 73-80. https://doi.org/10.14445/22312803/IJCTT-V59P112
-
Sas, J. (2006). Handwriting recognition accuracy improvement by author identification. In International Conference on Artificial Intelligence and Soft Computing (pp. 682-691). Springer.
-
Schifano, S. F., Sgarbanti, T., & Tomassetti, L. (2018). Authorship recognition and disambiguation of scientific papers using a neural networks approach. Proceedings of Science, 327, 007. https://doi.org/10.22323/1.327.0007.
-
Selman, S. (2012). Distinction of the authors of texts using multilayered feedforward neural networks. Southeast Europe Journal of Soft Computing, 1(1), 128-138.
-
Selman, S., & Husagic-Selman, A. (2011). Multilayered feedforward neural networks as a tool for distinction of the authors of texts. In International Symposium on Information, Communication and Automation Technologies (pp. 1-6). IEEE.
-
Sharma, N., & Kumar, A. (2024). Deep learning for stylometry and authorship attribution: A review of literature. International Journal for Research in Applied Science and Engineering Technology, 12(9), 212-215. https://doi.org/10.22214/ijraset.2024.64168
-
Singaravelan, S., Murugan, D., & Mayakrishnan, R. (2015). Analysis of classification algorithms J48 and Smo on different datasets. World Engineering & Applied Sciences Journal, 6(2), 119-123.
-
Škorić, M., Stanković, R., Ikonić Nešić, M., Byszuk, J., & Eder, M. (2022). Parallel stylometric document embeddings with deep learning based language models in literary authorship attribution. Mathematics, 10(5), Article 838. https://doi.org/10.3390/math10050838
-
Stanczyk, U., & Cyran, K. A. (2008). Application of artificial neural networks to stylometric analysis. In Proceedings of the 8th Conference on Systems Theory and Scientific Computation (pp. 25-30). https://dl.acm.org/doi/abs/10.5555/1503773.1503779
-
Svetnik, V., Liaw, A., Tong, C., Culberson, J. C., Sheridan, R. P., & Feuston, B. P. (2003). Random forest: A classification and regression tool for compound classification and QSAR modeling. Journal of Chemical Information and Computer Sciences, 43(6), 1947-1958. https://doi.org/10.1021/ci034160g
-
Tun, M. T., & Htay, Y. Y. (2020). Predict Students’ Performance by Using J48 Algorithm. International Journal of Scientific Research in Science, Engineering and Technology, 7(3), 578-582. https://doi.org/10.32628/IJSRSET2073124
-
Umeda, M., Miyoshi, T., & Misaki, K. (2002). Writer identification and verification using autoassociative neural networks. IEEJ Transactions on Electronics, Information and Systems, 122(11), 1869-1875. https://doi.org/10.1541/ieejeiss1987.122.11_1869
-
van Halteren, H. (2004). Linguistic profiling for authorship recognition and verification. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04) (pp. 199-206). Association for Computational Linguistics. https://doi.org/10.3115/1218955.1218981
-
Weka. (n.d.). Weka 3: Data mining software in Java. University of Waikato. Retrieved May 2025, from http://www.cs.waikato.ac.nz/ml/weka/
Stylometric Profiling of Turkish Texts: Joint Estimation of Author, Region, Age and Genre
Year 2026,
Volume: 14 Issue: 1, 288 - 298, 21.01.2026
Vecdi Emre Levent
,
Uğur Özbalkan
Abstract
Authorship identification seeks to determine the writer of a text by analyzing distinctive linguistic and stylistic features. These characteristics may vary across dimensions such as region, age, and genre. Identifying an author’s stylistic fingerprint is essential in plagiarism detection, digital forensics, and computational linguistics. In this study, the authorship features of Turkish columnists were analyzed using Artificial Neural Networks (ANN), Support Vector Machines (SVM), and decision tree algorithms (J48 and Random Forest). Sixteen stylometric indicators were selected through the Zemberek natural language processing library and evaluated across six distinct datasets. The proposed system allows flexible parameter adjustment through a graphical interface and exports results in ARFF format for reproducibility. Experimental results demonstrated that Random Forest achieved the highest overall accuracy, particularly in regional and age-based datasets, with F-measures reaching up to 0.91. The accuracy rates were 73% for regional classification, 55% for genre classification, and 62.5% for age-based classification. The findings confirm that combining statistical learning with stylometric analysis provides a robust framework for Turkish authorship attribution, paving the way for future studies employing deep learning and transformer-based models.
Ethical Statement
This study does not involve human or animal participants. All procedures followed scientific and ethical principles, and all referenced studies are appropriately cited.
Supporting Institution
This research received no external funding.
Thanks
The authors do not wish to acknowledge any individual or institution.
References
-
Akın, A. A., & Akın, M. D. (2007). Zemberek: An open source NLP framework for Turkic languages. Structure, 10, 1-5.
https://www.academia.edu/download/34521696/zemberek_makale.pdf
-
Amasyalı, M. F., & Yıldırım, T. (2004). Otomatik haber metinleri sınıflandırma. In Proceedings of the IEEE 12th Signal Processing and Communications Applications Conference (SIU 2004) (pp. 224–226). IEEE.
-
Amasyalı, M. F., & Diri, B. (2006). Automatic Turkish text categorization in terms of author, genre and gender. In International Conference on Application of Natural Language to Information Systems (pp. 221-226). Springer. https://doi.org/10.1007/11765448_22
-
Amasyalı, M. F., Diri, B., & Türkoğlu, F. (2006). Farklı özellik vektörleri ile Türkçe dokümanların yazarlarının belirlenmesi. In 15th Turkish Symposium on Artificial Intelligence and Neural Networks (pp. 1–4). Muğla, Türkiye.
-
Arora, R., & Suman, S. (2012). Comparative analysis of classification algorithms on different datasets using WEKA. International Journal of Computer Applications, 54(13), 21-25.
-
Birant, D. (2011). Comparison of decision tree algorithms for predicting potential air pollutant emissions with data mining models. Journal of Environmental Informatics, 17(1), 46–53. https://doi.org/10.3808/jei.201100186
-
Bhargava, N., Sharma, S., Purohit, R., & Rathore, P. S. (2017). Prediction of recurrence cancer using J48 algorithm. In Proceedings of the 2017 2nd International Conference on Communication and Electronics Systems (ICCES 2017) (pp. 386–390). IEEE. https://doi.org/10.1109/CESYS.2017.8321306
-
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32. https://doi.org/10.1023/A:1010933404324
-
Canbay, P., Sezer, E. A., & Sever, H. (2020). Deep combination of stylometry features for authorship analysis. International Journal of Information Security Science, 9(3), 154-163.
-
Chandrasekar, P., Qian, K., Shahriar, H., & Bhattacharya, P. (2017). Improving the prediction accuracy of decision tree mining with data preprocessing. In 2017 IEEE 41st Annual Computer Software and Applications Conference (COMPSAC) (Vol. 2, pp. 481-484). IEEE. https://doi.org/10.1109/COMPSAC.2017.146
-
Clarin, J. A. (2020). J48-based algorithm in predicting the success rate in the board examination. International Journal, 9(1.3), 122-127. https://doi.org/10.30534/ijatcse/2020/1791.32020
-
Clark, J. H., & Hannon, C. J. (2007). A classifier system for author recognition using synonym-based features. In Mexican International Conference on Artificial Intelligence (pp. 839-849). Springer. https://doi.org/10.1007/978-3-540-76631-5_80
-
Çatal, Ç., Erbakırcı, K., & Erenler, Y. (2003). Computer-based authorship attribution for Turkish documents. In Turkish Symposium on Artificial Intelligence and Neural Networks (pp. 539-541).
-
Elmas, Ç. (2003). Yapay sinir ağları. Seçkin Yayınevi.
-
García-Laencina, P. J., Sancho-Gómez, J. L., & Figueiras-Vidal, A. R. (2010). Pattern classification with missing data: A review. Neural Computing and Applications, 19(2), 263-282. https://doi.org/10.1007/s00521-009-0295-6
-
Horning, N. (2010). Random Forests: An algorithm for image classification and generation of continuous fields data sets. In Proceedings of the International Conference on Geoinformatics for Spatial Infrastructure Development in Earth and Allied Sciences (pp. 1–6). Osaka, Japan.
-
Kliegr, T., Bahník, Š., & Fürnkranz, J. (2020). Advances in machine learning for the behavioral sciences. American Behavioral Scientist, 64(2), 145-175. https://doi.org/10.1177/0002764219859639
-
Levent, V. E., & Diri, B. (2014). Türkçe dokümanlarda yapay sinir ağları ile yazar tanıma. In Akademik Bilişim 2014 Bildiriler Kitabı, (pp. 735-741).
-
Okulska, I., Stetsenko, D., Kołos, A., Karlińska, A., Głąbińska, K., & Nowakowski, A. (2023). Stylometrix: An open-source multilingual tool for representing stylometric vectors. arXiv. https://arxiv.org/abs/2309.12810
-
Remaida, A., Moumen, A., El Idrissi, Y. E. B., & Sabri, Z. (2020). Handwriting recognition with artificial neural networks a decade literature review. Proceedings of the 3rd International Conference on Networking, Information Systems & Security (Article 65, pp. 1–5). https://doi.org/10.1145/3386723.3387884
-
Rocha, M. A. D., Nóbrega, G. Â. S. D., de Medeiros Valentim, R. A., & Alves, L. P. C. (2020). A text as unique as fingerprint: AVASUS text analysis and authorship recognition. In Euro-American Conference on Telematics and Information Systems (pp. 1-8). ACM.
-
Rokach, L. (2016). Decision forest: Twenty years of research. Information Fusion, 27, 111-125. https://doi.org/10.1016/j.inffus.2015.06.005
-
Sağiroğlu, Ş., Erler, M., & Beşdok, E. (2003). Mühendislikte yapay zeka uygulamaları-I: Yapay sinir ağları. Ufuk Kitabevi.
-
Saravanan, N., & Gayathri, V. (2018). Performance and classification evaluation of J48 algorithm and Kendall’s based J48 algorithm (KNJ48). International Journal of Computer Trends and Technology, 59(1), 73-80. https://doi.org/10.14445/22312803/IJCTT-V59P112
-
Sas, J. (2006). Handwriting recognition accuracy improvement by author identification. In International Conference on Artificial Intelligence and Soft Computing (pp. 682-691). Springer.
-
Schifano, S. F., Sgarbanti, T., & Tomassetti, L. (2018). Authorship recognition and disambiguation of scientific papers using a neural networks approach. Proceedings of Science, 327, 007. https://doi.org/10.22323/1.327.0007.
-
Selman, S. (2012). Distinction of the authors of texts using multilayered feedforward neural networks. Southeast Europe Journal of Soft Computing, 1(1), 128-138.
-
Selman, S., & Husagic-Selman, A. (2011). Multilayered feedforward neural networks as a tool for distinction of the authors of texts. In International Symposium on Information, Communication and Automation Technologies (pp. 1-6). IEEE.
-
Sharma, N., & Kumar, A. (2024). Deep learning for stylometry and authorship attribution: A review of literature. International Journal for Research in Applied Science and Engineering Technology, 12(9), 212-215. https://doi.org/10.22214/ijraset.2024.64168
-
Singaravelan, S., Murugan, D., & Mayakrishnan, R. (2015). Analysis of classification algorithms J48 and Smo on different datasets. World Engineering & Applied Sciences Journal, 6(2), 119-123.
-
Škorić, M., Stanković, R., Ikonić Nešić, M., Byszuk, J., & Eder, M. (2022). Parallel stylometric document embeddings with deep learning based language models in literary authorship attribution. Mathematics, 10(5), Article 838. https://doi.org/10.3390/math10050838
-
Stanczyk, U., & Cyran, K. A. (2008). Application of artificial neural networks to stylometric analysis. In Proceedings of the 8th Conference on Systems Theory and Scientific Computation (pp. 25-30). https://dl.acm.org/doi/abs/10.5555/1503773.1503779
-
Svetnik, V., Liaw, A., Tong, C., Culberson, J. C., Sheridan, R. P., & Feuston, B. P. (2003). Random forest: A classification and regression tool for compound classification and QSAR modeling. Journal of Chemical Information and Computer Sciences, 43(6), 1947-1958. https://doi.org/10.1021/ci034160g
-
Tun, M. T., & Htay, Y. Y. (2020). Predict Students’ Performance by Using J48 Algorithm. International Journal of Scientific Research in Science, Engineering and Technology, 7(3), 578-582. https://doi.org/10.32628/IJSRSET2073124
-
Umeda, M., Miyoshi, T., & Misaki, K. (2002). Writer identification and verification using autoassociative neural networks. IEEJ Transactions on Electronics, Information and Systems, 122(11), 1869-1875. https://doi.org/10.1541/ieejeiss1987.122.11_1869
-
van Halteren, H. (2004). Linguistic profiling for authorship recognition and verification. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04) (pp. 199-206). Association for Computational Linguistics. https://doi.org/10.3115/1218955.1218981
-
Weka. (n.d.). Weka 3: Data mining software in Java. University of Waikato. Retrieved May 2025, from http://www.cs.waikato.ac.nz/ml/weka/