THE IMPACT OF TEXT REPRESENTATION AND PREPROCESSING ON AUTHOR IDENTIFICATION

Muhammet Yasin Pak; Serkan Gunal

doi:10.18038/aubtda.270276

EN

THE IMPACT OF TEXT REPRESENTATION AND PREPROCESSING ON AUTHOR IDENTIFICATION

Abstract

Author identification, one of the popular topics in text classification and natural language processing, basically aims to determine the author of a given text through various analyses. In the literature, different text representation approaches and use of preprocessing steps are considered for author identification problem. This paper aims to comprehensively examine the impact of text representation and preprocessing steps on author identification specifically for Turkish language. For this purpose, the contributions of all possible combinations of different text representation approaches, namely unigram and bigram, together with the preprocessing tasks, including stemming and stop-word removal, to the performance of author identification are investigated. For the experimental evaluation, a brand new dataset is constituted. Also, two different classification algorithms, namely Multinomial Naive Bayes and Sequential Minimal Optimization, are employed. The results of the experimental analysis reveal that using bigram features alone should be avoided. Besides, it is shown that stop-words should be kept inside the text while stemming can be preferred depending on the classification algorithm so that higher performance can be achieved for author identification.

Keywords

Author identification,text classification,text preprocessing,text representation

References

Aslantürk O. Turkish authorship analysis with an incremental and adaptive model. MSc Dissertation, Hacettepe University, Ankara, Turkey, 2014.
Diri B, Amasyalı MF. Automatic author detection for Turkish texts. Artificial Neural Networks and Neural Information Processing 2003, 138-141.
Amasyalı MF, Diri B. Automatic Turkish text categorization in terms of author, genre and gender. In: NLDB 11th International Conference on Applications of Natural Language to Information Systems; 2006; Klagenfurt, Austria. pp. 221-226.
Amasyalı MF, Diri B, Türkoğlu F. Farklı özellik vektörleri ile Türkçe dokümanların yazarlarının belirlenmesi. In: The 15th Turkish Symposium on Artificial Intelligence and Neural Networks (TAINN); 21-24 June 2006; Muğla, Turkey.
Türkoğlu F, Diri B, Amasyalı MF. Author attribution of Turkish texts by feature mining. In: The 3rd International Conference on Intelligent Computing: Advanced Intelligent Computing Theories and Applications with Aspects of Artificial Intelligence; 2007; Qingdao, China. pp. 1086–1093.
Kaban Z, Diri B. Genre and author detection in Turkish texts using artificial immune recognition systems. In: IEEE 16th Signal Processing, Communication and Applications Conference; April 2008. pp. 1-4.
Orucu F. Turkish Language Characteristics and Author Identification. MSc. Dissertation, Dokuz Eylül University, İzmir, 2009.
Bay Y, Çelebi E, Feature Selection for Enhanced Author Identification of Turkish Text. In: the 30th International Symposium on Computer and Information Sciences, 2015. pp. 371-379.
Stamatatos E. A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology 2009; 60(3): 538-556.
Joachims T. A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. Carnegie-Mellon Univ. Pittsburgh PA Dept. of Computer Science 1996.

Gunal S. Hybrid feature selection for text classification, Turkish Journal of Electrical Engineering & Computer Sciences 2012; 20(sup.2): 1296-1311.
Uysal AK, Gunal S, Ergin S, Sora Gunal E. The impact of feature extraction and selection on SMS spam filtering. Elektronika ir Elektrotechnika 2013; 19(5): 67-72.
Pak MY, Gunal S. Sentiment classification based on domain prediction, Elektronika ir Elektrotechnika 2016; 22(2): 96-99.
Manning CD, Raghavan P, Schtze H. Introduction to Information Retrieval. New York, USA: Cambridge University Press, 2008
Uysal AK, Gunal S. The impact of preprocessing on text classification. Information Processing & Management 2014; 50(1): 104-112.
Can F, Kocberber S, Balcik E, Kaynak C, Ocalan HC, Vursavas OM. Information retrieval on Turkish texts. Journal of the American Society for Information Science and Technology 2008, 59: 407–421.
Zemberek. (Accessed October 2016).
Gunal S, Edizkan R. Subspace based feature selection for pattern recognition. Information Sciences 2008; 178(19): 3716-3726.
McCallum A, Nigam K. A comparison of event models for naïve Bayes text classification. AAAI-98 Workshop on Learning for Text Categorization 1998; 752: 41-48.
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH. The WEKA data mining software: an update. ACM SIGKDD Explorations Newsletter 2009; 11(1): 10-18.
Platt JC. Fast training of support vector machines using sequential minimal optimization. Advances in Kernel Methods 1999; 185-208.

Details

Primary Language

English

Subjects

Engineering

Journal Section

Research Article

Authors

Muhammet Yasin Pak
Department of Computer Engineering, Anadolu University
Türkiye

Serkan Gunal
Department of Computer Engineering, Anadolu University
Türkiye

Publication Date

March 31, 2017

Submission Date

November 28, 2016

Acceptance Date

-

Published in Issue

Year 2017 Volume: 18 Number: 1

DOI

https://doi.org/10.18038/aubtda.270276

IZ

https://izlik.org/JA53MN99FF

APA

Pak, M. Y., & Gunal, S. (2017). THE IMPACT OF TEXT REPRESENTATION AND PREPROCESSING ON AUTHOR IDENTIFICATION. Anadolu University Journal of Science and Technology A - Applied Sciences and Engineering, 18(1), 218-224. https://doi.org/10.18038/aubtda.270276

AMA

1.Pak MY, Gunal S. THE IMPACT OF TEXT REPRESENTATION AND PREPROCESSING ON AUTHOR IDENTIFICATION. AUJST-A. 2017;18(1):218-224. doi:10.18038/aubtda.270276

Chicago

Pak, Muhammet Yasin, and Serkan Gunal. 2017. “THE IMPACT OF TEXT REPRESENTATION AND PREPROCESSING ON AUTHOR IDENTIFICATION”. Anadolu University Journal of Science and Technology A - Applied Sciences and Engineering 18 (1): 218-24. https://doi.org/10.18038/aubtda.270276.

EndNote

Pak MY, Gunal S (March 1, 2017) THE IMPACT OF TEXT REPRESENTATION AND PREPROCESSING ON AUTHOR IDENTIFICATION. Anadolu University Journal of Science and Technology A - Applied Sciences and Engineering 18 1 218–224.

IEEE

[1]M. Y. Pak and S. Gunal, “THE IMPACT OF TEXT REPRESENTATION AND PREPROCESSING ON AUTHOR IDENTIFICATION”, AUJST-A, vol. 18, no. 1, pp. 218–224, Mar. 2017, doi: 10.18038/aubtda.270276.

ISNAD

Pak, Muhammet Yasin - Gunal, Serkan. “THE IMPACT OF TEXT REPRESENTATION AND PREPROCESSING ON AUTHOR IDENTIFICATION”. Anadolu University Journal of Science and Technology A - Applied Sciences and Engineering 18/1 (March 1, 2017): 218-224. https://doi.org/10.18038/aubtda.270276.

JAMA

1.Pak MY, Gunal S. THE IMPACT OF TEXT REPRESENTATION AND PREPROCESSING ON AUTHOR IDENTIFICATION. AUJST-A. 2017;18:218–224.

MLA

Pak, Muhammet Yasin, and Serkan Gunal. “THE IMPACT OF TEXT REPRESENTATION AND PREPROCESSING ON AUTHOR IDENTIFICATION”. Anadolu University Journal of Science and Technology A - Applied Sciences and Engineering, vol. 18, no. 1, Mar. 2017, pp. 218-24, doi:10.18038/aubtda.270276.

Vancouver

1.Muhammet Yasin Pak, Serkan Gunal. THE IMPACT OF TEXT REPRESENTATION AND PREPROCESSING ON AUTHOR IDENTIFICATION. AUJST-A. 2017 Mar. 1;18(1):218-24. doi:10.18038/aubtda.270276

Abstract

THE IMPACT OF TEXT REPRESENTATION AND PREPROCESSING ON AUTHOR IDENTIFICATION

Abstract

Keywords

References

Details

Primary Language

Subjects

Journal Section

Authors

Publication Date

Submission Date

Acceptance Date

Published in Issue

DOI

IZ

Cite