THE IMPACT OF TEXT REPRESENTATION AND PREPROCESSING ON AUTHOR IDENTIFICATION
Abstract
Author identification, one of the popular topics in text classification and natural language processing, basically aims to determine the author of a given text through various analyses. In the literature, different text representation approaches and use of preprocessing steps are considered for author identification problem. This paper aims to comprehensively examine the impact of text representation and preprocessing steps on author identification specifically for Turkish language. For this purpose, the contributions of all possible combinations of different text representation approaches, namely unigram and bigram, together with the preprocessing tasks, including stemming and stop-word removal, to the performance of author identification are investigated. For the experimental evaluation, a brand new dataset is constituted. Also, two different classification algorithms, namely Multinomial Naive Bayes and Sequential Minimal Optimization, are employed. The results of the experimental analysis reveal that using bigram features alone should be avoided. Besides, it is shown that stop-words should be kept inside the text while stemming can be preferred depending on the classification algorithm so that higher performance can be achieved for author identification.
Keywords
Author identification,text classification,text preprocessing,text representation
References
- Aslantürk O. Turkish authorship analysis with an incremental and adaptive model. MSc Dissertation, Hacettepe University, Ankara, Turkey, 2014.
- Diri B, Amasyalı MF. Automatic author detection for Turkish texts. Artificial Neural Networks and Neural Information Processing 2003, 138-141.
- Amasyalı MF, Diri B. Automatic Turkish text categorization in terms of author, genre and gender. In: NLDB 11th International Conference on Applications of Natural Language to Information Systems; 2006; Klagenfurt, Austria. pp. 221-226.
- Amasyalı MF, Diri B, Türkoğlu F. Farklı özellik vektörleri ile Türkçe dokümanların yazarlarının belirlenmesi. In: The 15th Turkish Symposium on Artificial Intelligence and Neural Networks (TAINN); 21-24 June 2006; Muğla, Turkey.
- Türkoğlu F, Diri B, Amasyalı MF. Author attribution of Turkish texts by feature mining. In: The 3rd International Conference on Intelligent Computing: Advanced Intelligent Computing Theories and Applications with Aspects of Artificial Intelligence; 2007; Qingdao, China. pp. 1086–1093.
- Kaban Z, Diri B. Genre and author detection in Turkish texts using artificial immune recognition systems. In: IEEE 16th Signal Processing, Communication and Applications Conference; April 2008. pp. 1-4.
- Orucu F. Turkish Language Characteristics and Author Identification. MSc. Dissertation, Dokuz Eylül University, İzmir, 2009.
- Bay Y, Çelebi E, Feature Selection for Enhanced Author Identification of Turkish Text. In: the 30th International Symposium on Computer and Information Sciences, 2015. pp. 371-379.
- Stamatatos E. A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology 2009; 60(3): 538-556.
- Joachims T. A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. Carnegie-Mellon Univ. Pittsburgh PA Dept. of Computer Science 1996.