Research Article
BibTex RIS Cite

Text Authorship Identification Based On Ensemble Learning and Genetic Algorithm Combination in Turkish Text

Year 2022, Volume: 25 Issue: 3, 1287 - 1297, 01.10.2022
https://doi.org/10.2339/politeknik.992493

Abstract

The easiness of reaching information through the internet and social media and the expansiveness of opportunities for searching, copying, and spreading data have caused some problems in identifying an author for a specific text. A text carries the characteristic features of the person who wrote it, and these features can be used to identify its author. For this study, we are offering a method that is based on an approach using ensemble learning algorithm (ELA) and genetic algorithm (GA) for author identification in Tur-kish texts. The raw data set, which includes 40 authors and 3269 texts, was created from Turkish news websites and analyzed in pre-processing step. After, syntactic and structural analyses were done on the data and, in total, 6 different data sets were created. Each of the data sets was subjected to the feature selection process by using GA and ELA approach together. Each of the obtained data sets from the previous step was classified by using the ELA's bagging method which contains 5 different classifiers, namely, Naive Bayes, K-Nearest Neighbor, Artificial Neural Networks, Support Vector Machine, and Decision Tree. After applying the aforementioned processes to the raw data, the author identification approach reached 89% accuracy. The combination of ELA and GA has a strong potential to identify the author of a text.

References

  • [1] T. Neal, K. Sundararajan, A. Fatima, Y. Yan, Y. Xiang, and D. Woodard, “Surveying Stylometry Techniques and Applications,” ACM Comput. Surv., 50(6):1–36, (2018).
  • [2] S. E. De Morgan and A. De Morgan, “Memoir of Augustus de Morgan by his wife Sophia Elizabeth de Morgan with selections from his letters.,” London Longmans, Green, Co., (1882).
  • [3] T. C. Mendenhall, “The Characteristic Curves of Composition,” Science (80-. )., 9(214):237–249, (1887).
  • [4] G. U. Yule, “The statistical study of literary vocabulary,” Cambridge [engl. Univ. Press, (1944).
  • [5] F. Mosteller and D. L. Wallace, “Inference and disputed authorship: the federalist papers,” Addison-Wesley, Reading, Mass, (1964).
  • [6] R. Sarwar, T. Porthaveepong, A. Rutherford, T. Rakthanmanon, and S. Nutanong, “StyloThai: A scalable framework for stylometric authorship identification of Thai documents,” ACM Trans. Asian Low-Resource Lang. Inf. Process., 19 (3), (2020).
  • [7] A. F. Otoom, E. E. Abdullah, S. Jaafer, A. Hamdallh, and D. Amer, “Towards author identification of Arabic text articles,” in 2014 5th International Conference on Information and Communication Systems (ICICS), 1–4, (2017).
  • [8] S. Ouamour and H. Sayoud, “Authorship Attribution of Short Historical Arabic Texts Based on Lexical Features,” in 2013 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery, 144–147, (2013).
  • [9] D. L. Hoover, “Statistical Stylistics and Authorship Attribution: an Empirical Investigation,” Lit. Linguist. Comput., 16 (4): 421–444, (2001).
  • [10] H. Sayoud, “Author discrimination between the holy Quran and Prophet’s statements,” Lit. Linguist. Comput., 27(4): 427–444, (2012).
  • [11] J. Diederich, J. Kindermann, E. Leopold, and G. Paass, “Authorship attribution with support vector machines,” Appl. Intell., 19(1): 109–123, (2003).
  • [12] M. Koppel, D. Mughaz, and N. Akiva, “New methods for attribution of Rabbinic literature. Hebrew Linguistics: A Journal for Hebrew Descriptive,” Comput. Appl. Linguist., 57:. 5–18, (2006).
  • [13] R. Zheng, J. Li, H. Chen, and Z. Huang, “A framework for authorship identification of online messages: Writing-style features and classification techniques,” J. Am. Soc. Inf. Sci. Technol., 57(3): 378–393, (2006).
  • [14] V. Keselj, F. Peng, N. Cercone, and C. Thomas, “N-gram-based author profiles for authorship attribution,” Proc. Pacific Assoc. Comput. Linguist.,255–264, (2003).
  • [15] O. V. Kukushkina, A. A. Polikarpov, and D. V. Khmelev, “Using Literal and Grammatical Statistics for Authorship Attribution,” Probl. Inf. Transm., 37(2): 172–184, (2001).
  • [16] P. Juola, “A Controlled-corpus Experiment in Authorship Identification by Cross-entropy,” Lit. Linguist. Comput., 20(1): 59–67, (2005).
  • [17] J. Savoy, “Comparative evaluation of term selection functions for authorship attribution,” Digit. Scholarsh. Humanit., 30( 2): 246–261, (2015).
  • [18] E. Ekinci and H. Takci, “Using authorship analysis techniques in forensic analysis of electronic mails,” in 2012 20th Signal Processing and Communications Applications Conference (SIU), 1–4, (2012).
  • [19] H. V. Agun, S. Yilmazel, and O. Yilmazel, “Effects of language processing in Turkish authorship attribution,” in 2017 IEEE International Conference on Big Data (Big Data),. 1876–1881,(2017).
  • [20] E. Aydemir, “Türkçe Köşe Yazılarında Yapay Sinir Ağlarıyla Yazar ve Gazete Tahmin Etme,” DÜMF Mühendislik Derg., 10(1): 45–56, (2019).
  • [21] F. Türkoğlu, B. Diri, and M. F. Amasyalı, “Author Attribution of Turkish Texts by Feature Mining,” in Advanced Intelligent Computing Theories and Applications. With Aspects of Theoretical and Methodological Issues, Berlin, Heidelberg: Springer Berlin Heidelberg, 1086–1093, (2007).
  • [22] Y. Aktaş, E. Y. İnce, and A. Çakir, “Doğal Dil İşleme Kulla narak Bilgisayar Ağ Terimlerinin Wordnet Ontolojisinde Uyarlanması Wordnet Ontology Based Creation Of Computer Network Terms By Using Natural Language Processing,” (2017).
  • [23] M. Zhou, N. Duan, S. Liu, and H.-Y. Shum, “Progress in Neural NLP: Modeling, Learning, and Reasoning,” Engineering, 6(3): 275–290, (2020).
  • [24] H. Polat and M. Körpe, “TBMM Genel Kurul Tutanaklarından Yakın Anlamlı Kavramların Çıkarılması,” Bilişim Teknol. Derg., 11(3), (2018).
  • [25] N. Doğan, “İstem Sözlükleri ve Türkçe,” J. Acad. Soc. Sci. Stud., 1(42): 251, (2016).
  • [26] O. Coban and I. Karabey, “Music genre classification with word and document vectors,” in 2017 25th Signal Processing and Communications Applications Conference (SIU), 1–4, (2017).
  • [27] E. Yıldırım, F. Çetin, E. G., and T. T., “The Impact of NLP on Turkish Sentiment Analysis,” Türkiye Bilişim Vakfı Bilgi. Bilim. ve Mühendislik Dergisi, 43–51, (2015).
  • [28] A. S. Yüksel and F. G. Tan, “Metin Madenciliği Teknikleri ile Sosyal Ağlarda Bilgi Keşfi,” Mühendislik Bilim. ve Tasarım Derg., 6(2): 324–333, (2018).
  • [29] A. G. Vural, B. B. Cambazoglu, P. Senkul, and Z. O. Tokgoz, “A Framework for Sentiment Analysis in Turkish: Application to Polarity Detection of Movie Reviews in Turkish,” in Computer and Information Sciences III, London: Springer London, 437–445, (2013).
  • [30] C. Bechikh Ali, H. Haddad, and Y. Slimani, “Empirical evaluation of compounds indexing for Turkish texts,” Comput. Speech Lang., 56: 95–106, (2019).
  • [31] A. A. Akın and M. D. Akın, “Zemberek, an open source NLP framework for Turkic Languages,” Structure, 10: 1–5, (2007).
  • [32] E. Loper and S. Bird, “NLTK: the Natural Language Toolkit,” in Proceedings of the ACL-02 Workshop on Effective tools and methodologies for teaching natural language processing and computational linguistics -, 1: 63–70, (2002).
  • [33] N. An, H. Ding, J. Yang, R. Au, and T. F. A. Ang, “Deep ensemble learning for Alzheimer’s disease classification,” J. Biomed. Inform., 105: 103411, (2020).
  • [34] Y. Zhu, W. XU, G. Luo, H. Wang, J. Yang, and W. Lu, “Random Forest enhancement using improved Artificial Fish Swarm for the medial knee contact force prediction,” Artif. Intell. Med., 103: 101811, (2020).
  • [35] L. Breiman, “Bagging predictors” Mach. Learn., 24(2): 123–140, (1996).
  • [36] S. Agarwal and C. R. Chowdary, “A-Stacking and A-Bagging: Adaptive versions of ensemble learning algorithms for spoof fingerprint detection,” Expert Syst. Appl., 146: 113160, (2020).
  • [37] J. H. Holland, “Genetic algorithms,” Sci. Am., 267( 1): 66–73, (1992).
  • [38] J. Yang and V. Honavar, “Feature subset selection using a genetic algorithm,” IEEE Intell. Syst., 13(2): 44–49, (1998).
  • [39] G. L. Pappa, A. A. Freitas, and C. A. A. Kaestner, “Attribute Selection with a Multi-objective Genetic Algorithm,”, 280–290, (2002).
  • [40] T. Taş and A. K. Görür, “Author Identification for Turkish Texts,” Çankaya Üniversitesi Fen-Edebiyat Fakültesi, J. Arts Sci., 7: 151–161, (2007).
  • [41] S. Doğan and B. Diri, “Türkçe Dokümanlar İçin N-gram Tabanlı Yeni Bir Sınıflandırma ( Ng-ind ): Yazar , Tür ve Cinsiyet,” Türkiye Bilişim Vakfı Bilgi. Bilim. ve Mühendisliği Derg, 1(3): 11–19, (2010).
  • [42] T. Uyar, K. Karacan Uyar, and E. Yağlı, “Gözetimli Makine Öğrenmesiyle Noktalama ve Etkisiz Kelime Sıklıkları Kullanarak Yazar Tanıma,” Bilişim Teknol. Derg.,14(2): 183–190, (2021).

Türkçe Metinde Topluluk Öğrenme ve Genetik Algoritma Kombinasyonu Tabanlı Yazar Tahmini

Year 2022, Volume: 25 Issue: 3, 1287 - 1297, 01.10.2022
https://doi.org/10.2339/politeknik.992493

Abstract

İnternet ve sosyal medya aracılığıyla bilgiye ulaşmanın kolaylaşması ve veri arama, kopyalama ve yayma olanaklarının geniş olması, belirli bir metin için yazar belirlemede bazı sorunlara neden olmuştur. Bir metin, onu yazan kişinin karakteristik özelliklerini taşır ve bu özellikler onun yazarını belirlemek için kullanılabilir. Bu çalışma için, Türkçe metinlerde yazar tespiti için topluluk öğrenme algo-ritması (TÖA) ve genetik algoritma (GA) kullanan bir yaklaşıma dayalı bir yöntem sunuyoruz. 40 yazar ve 3269 metinden oluşan ham veri seti Türkçe haber sitelerinden oluşturulmuş ve ön işleme aşamasında analiz edilmiştir. Daha sonra veriler üzerinde sözdi-zimsel ve yapısal analizler yapılmış ve toplamda 6 farklı veri seti oluşturulmuştur. Veri setlerinin her biri, GA ve TÖA yaklaşımı birlikte kullanılarak öznitelik seçim sürecine tabi tutulmuştur. Bir önceki adımdan elde edilen veri setlerinin her biri, TÖA'nın Naive Bayes, K-En Yakın Komşu, Yapay Sinir Ağları, Destek Vektör Makinesi ve Karar Ağacı olmak üzere 5 farklı sınıflandırıcı içeren torbalama yöntemi kullanılarak sınıflandırılmıştır. Ham verilere yukarıda bahsedilen işlemler uygulandıktan sonra yazar belirleme yaklaşımı %89 doğruluğa ulaşmıştır. TÖA ve GA kombinasyonu, bir metnin yazarını belirlemek için güçlü bir potansiyele sahiptir.

References

  • [1] T. Neal, K. Sundararajan, A. Fatima, Y. Yan, Y. Xiang, and D. Woodard, “Surveying Stylometry Techniques and Applications,” ACM Comput. Surv., 50(6):1–36, (2018).
  • [2] S. E. De Morgan and A. De Morgan, “Memoir of Augustus de Morgan by his wife Sophia Elizabeth de Morgan with selections from his letters.,” London Longmans, Green, Co., (1882).
  • [3] T. C. Mendenhall, “The Characteristic Curves of Composition,” Science (80-. )., 9(214):237–249, (1887).
  • [4] G. U. Yule, “The statistical study of literary vocabulary,” Cambridge [engl. Univ. Press, (1944).
  • [5] F. Mosteller and D. L. Wallace, “Inference and disputed authorship: the federalist papers,” Addison-Wesley, Reading, Mass, (1964).
  • [6] R. Sarwar, T. Porthaveepong, A. Rutherford, T. Rakthanmanon, and S. Nutanong, “StyloThai: A scalable framework for stylometric authorship identification of Thai documents,” ACM Trans. Asian Low-Resource Lang. Inf. Process., 19 (3), (2020).
  • [7] A. F. Otoom, E. E. Abdullah, S. Jaafer, A. Hamdallh, and D. Amer, “Towards author identification of Arabic text articles,” in 2014 5th International Conference on Information and Communication Systems (ICICS), 1–4, (2017).
  • [8] S. Ouamour and H. Sayoud, “Authorship Attribution of Short Historical Arabic Texts Based on Lexical Features,” in 2013 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery, 144–147, (2013).
  • [9] D. L. Hoover, “Statistical Stylistics and Authorship Attribution: an Empirical Investigation,” Lit. Linguist. Comput., 16 (4): 421–444, (2001).
  • [10] H. Sayoud, “Author discrimination between the holy Quran and Prophet’s statements,” Lit. Linguist. Comput., 27(4): 427–444, (2012).
  • [11] J. Diederich, J. Kindermann, E. Leopold, and G. Paass, “Authorship attribution with support vector machines,” Appl. Intell., 19(1): 109–123, (2003).
  • [12] M. Koppel, D. Mughaz, and N. Akiva, “New methods for attribution of Rabbinic literature. Hebrew Linguistics: A Journal for Hebrew Descriptive,” Comput. Appl. Linguist., 57:. 5–18, (2006).
  • [13] R. Zheng, J. Li, H. Chen, and Z. Huang, “A framework for authorship identification of online messages: Writing-style features and classification techniques,” J. Am. Soc. Inf. Sci. Technol., 57(3): 378–393, (2006).
  • [14] V. Keselj, F. Peng, N. Cercone, and C. Thomas, “N-gram-based author profiles for authorship attribution,” Proc. Pacific Assoc. Comput. Linguist.,255–264, (2003).
  • [15] O. V. Kukushkina, A. A. Polikarpov, and D. V. Khmelev, “Using Literal and Grammatical Statistics for Authorship Attribution,” Probl. Inf. Transm., 37(2): 172–184, (2001).
  • [16] P. Juola, “A Controlled-corpus Experiment in Authorship Identification by Cross-entropy,” Lit. Linguist. Comput., 20(1): 59–67, (2005).
  • [17] J. Savoy, “Comparative evaluation of term selection functions for authorship attribution,” Digit. Scholarsh. Humanit., 30( 2): 246–261, (2015).
  • [18] E. Ekinci and H. Takci, “Using authorship analysis techniques in forensic analysis of electronic mails,” in 2012 20th Signal Processing and Communications Applications Conference (SIU), 1–4, (2012).
  • [19] H. V. Agun, S. Yilmazel, and O. Yilmazel, “Effects of language processing in Turkish authorship attribution,” in 2017 IEEE International Conference on Big Data (Big Data),. 1876–1881,(2017).
  • [20] E. Aydemir, “Türkçe Köşe Yazılarında Yapay Sinir Ağlarıyla Yazar ve Gazete Tahmin Etme,” DÜMF Mühendislik Derg., 10(1): 45–56, (2019).
  • [21] F. Türkoğlu, B. Diri, and M. F. Amasyalı, “Author Attribution of Turkish Texts by Feature Mining,” in Advanced Intelligent Computing Theories and Applications. With Aspects of Theoretical and Methodological Issues, Berlin, Heidelberg: Springer Berlin Heidelberg, 1086–1093, (2007).
  • [22] Y. Aktaş, E. Y. İnce, and A. Çakir, “Doğal Dil İşleme Kulla narak Bilgisayar Ağ Terimlerinin Wordnet Ontolojisinde Uyarlanması Wordnet Ontology Based Creation Of Computer Network Terms By Using Natural Language Processing,” (2017).
  • [23] M. Zhou, N. Duan, S. Liu, and H.-Y. Shum, “Progress in Neural NLP: Modeling, Learning, and Reasoning,” Engineering, 6(3): 275–290, (2020).
  • [24] H. Polat and M. Körpe, “TBMM Genel Kurul Tutanaklarından Yakın Anlamlı Kavramların Çıkarılması,” Bilişim Teknol. Derg., 11(3), (2018).
  • [25] N. Doğan, “İstem Sözlükleri ve Türkçe,” J. Acad. Soc. Sci. Stud., 1(42): 251, (2016).
  • [26] O. Coban and I. Karabey, “Music genre classification with word and document vectors,” in 2017 25th Signal Processing and Communications Applications Conference (SIU), 1–4, (2017).
  • [27] E. Yıldırım, F. Çetin, E. G., and T. T., “The Impact of NLP on Turkish Sentiment Analysis,” Türkiye Bilişim Vakfı Bilgi. Bilim. ve Mühendislik Dergisi, 43–51, (2015).
  • [28] A. S. Yüksel and F. G. Tan, “Metin Madenciliği Teknikleri ile Sosyal Ağlarda Bilgi Keşfi,” Mühendislik Bilim. ve Tasarım Derg., 6(2): 324–333, (2018).
  • [29] A. G. Vural, B. B. Cambazoglu, P. Senkul, and Z. O. Tokgoz, “A Framework for Sentiment Analysis in Turkish: Application to Polarity Detection of Movie Reviews in Turkish,” in Computer and Information Sciences III, London: Springer London, 437–445, (2013).
  • [30] C. Bechikh Ali, H. Haddad, and Y. Slimani, “Empirical evaluation of compounds indexing for Turkish texts,” Comput. Speech Lang., 56: 95–106, (2019).
  • [31] A. A. Akın and M. D. Akın, “Zemberek, an open source NLP framework for Turkic Languages,” Structure, 10: 1–5, (2007).
  • [32] E. Loper and S. Bird, “NLTK: the Natural Language Toolkit,” in Proceedings of the ACL-02 Workshop on Effective tools and methodologies for teaching natural language processing and computational linguistics -, 1: 63–70, (2002).
  • [33] N. An, H. Ding, J. Yang, R. Au, and T. F. A. Ang, “Deep ensemble learning for Alzheimer’s disease classification,” J. Biomed. Inform., 105: 103411, (2020).
  • [34] Y. Zhu, W. XU, G. Luo, H. Wang, J. Yang, and W. Lu, “Random Forest enhancement using improved Artificial Fish Swarm for the medial knee contact force prediction,” Artif. Intell. Med., 103: 101811, (2020).
  • [35] L. Breiman, “Bagging predictors” Mach. Learn., 24(2): 123–140, (1996).
  • [36] S. Agarwal and C. R. Chowdary, “A-Stacking and A-Bagging: Adaptive versions of ensemble learning algorithms for spoof fingerprint detection,” Expert Syst. Appl., 146: 113160, (2020).
  • [37] J. H. Holland, “Genetic algorithms,” Sci. Am., 267( 1): 66–73, (1992).
  • [38] J. Yang and V. Honavar, “Feature subset selection using a genetic algorithm,” IEEE Intell. Syst., 13(2): 44–49, (1998).
  • [39] G. L. Pappa, A. A. Freitas, and C. A. A. Kaestner, “Attribute Selection with a Multi-objective Genetic Algorithm,”, 280–290, (2002).
  • [40] T. Taş and A. K. Görür, “Author Identification for Turkish Texts,” Çankaya Üniversitesi Fen-Edebiyat Fakültesi, J. Arts Sci., 7: 151–161, (2007).
  • [41] S. Doğan and B. Diri, “Türkçe Dokümanlar İçin N-gram Tabanlı Yeni Bir Sınıflandırma ( Ng-ind ): Yazar , Tür ve Cinsiyet,” Türkiye Bilişim Vakfı Bilgi. Bilim. ve Mühendisliği Derg, 1(3): 11–19, (2010).
  • [42] T. Uyar, K. Karacan Uyar, and E. Yağlı, “Gözetimli Makine Öğrenmesiyle Noktalama ve Etkisiz Kelime Sıklıkları Kullanarak Yazar Tanıma,” Bilişim Teknol. Derg.,14(2): 183–190, (2021).
There are 42 citations in total.

Details

Primary Language English
Subjects Engineering
Journal Section Research Article
Authors

Merve Güllü 0000-0001-7442-1332

Hüseyin Polat 0000-0003-4128-2625

Publication Date October 1, 2022
Submission Date September 7, 2021
Published in Issue Year 2022 Volume: 25 Issue: 3

Cite

APA Güllü, M., & Polat, H. (2022). Text Authorship Identification Based On Ensemble Learning and Genetic Algorithm Combination in Turkish Text. Politeknik Dergisi, 25(3), 1287-1297. https://doi.org/10.2339/politeknik.992493
AMA Güllü M, Polat H. Text Authorship Identification Based On Ensemble Learning and Genetic Algorithm Combination in Turkish Text. Politeknik Dergisi. October 2022;25(3):1287-1297. doi:10.2339/politeknik.992493
Chicago Güllü, Merve, and Hüseyin Polat. “Text Authorship Identification Based On Ensemble Learning and Genetic Algorithm Combination in Turkish Text”. Politeknik Dergisi 25, no. 3 (October 2022): 1287-97. https://doi.org/10.2339/politeknik.992493.
EndNote Güllü M, Polat H (October 1, 2022) Text Authorship Identification Based On Ensemble Learning and Genetic Algorithm Combination in Turkish Text. Politeknik Dergisi 25 3 1287–1297.
IEEE M. Güllü and H. Polat, “Text Authorship Identification Based On Ensemble Learning and Genetic Algorithm Combination in Turkish Text”, Politeknik Dergisi, vol. 25, no. 3, pp. 1287–1297, 2022, doi: 10.2339/politeknik.992493.
ISNAD Güllü, Merve - Polat, Hüseyin. “Text Authorship Identification Based On Ensemble Learning and Genetic Algorithm Combination in Turkish Text”. Politeknik Dergisi 25/3 (October 2022), 1287-1297. https://doi.org/10.2339/politeknik.992493.
JAMA Güllü M, Polat H. Text Authorship Identification Based On Ensemble Learning and Genetic Algorithm Combination in Turkish Text. Politeknik Dergisi. 2022;25:1287–1297.
MLA Güllü, Merve and Hüseyin Polat. “Text Authorship Identification Based On Ensemble Learning and Genetic Algorithm Combination in Turkish Text”. Politeknik Dergisi, vol. 25, no. 3, 2022, pp. 1287-9, doi:10.2339/politeknik.992493.
Vancouver Güllü M, Polat H. Text Authorship Identification Based On Ensemble Learning and Genetic Algorithm Combination in Turkish Text. Politeknik Dergisi. 2022;25(3):1287-9.