A Comparison of Different Approaches to Document Representation in Turkish Language

Volume: 22 Number: 2 August 15, 2018
  • Savaş Yıldırım
  • Tuğba Yıldız

A Comparison of Different Approaches to Document Representation in Turkish Language

Abstract

Recently, deep learning methods have demonstrated state-of-the-art performance in numerous complex Natural Language Processing (NLP) problems. Easy accessibility of high-performance computing resources and open-source libraries makes Artificial Intelligence (AI) approaches more applicable for researchers. This sudden growth of available techniques shaped and improved standards in the field of NLP. Thus, we find an opportunity to compare different approaches to document representation, owing to various open-source libraries and a large amount of research. We evaluate four different paradigms to represent documents: Traditional bag-of-words approaches, topic modeling, embedding based approach and deep learning. As the main contribution of this article, we aim at evaluating all these representation approaches with suitable machine learning algorithms for document categorization problem in the Turkish language. The supervised architecture uses a benchmark dataset specifically prepared for this language. Within the architecture, we evaluate the representation approaches with corresponding machine learning algorithms such as Support Vector Machine (SVM), multi-nominal Naive Bayes Algorithm (m-NB) and so forth. We conduct a variety of experiments and present successful results for the Turkish document categorization. We also observed that tradition approaches have still comparable results with Neural Network models in terms of document classification.

Keywords

References

  1. [1] B. Açıkalın and N. G. Bayazıt. 2016. The importance of preprocessing in Turkish Text classification. In 2016 24th Signal Processing and Communication Application Conference (SIU). 2053–2056. https://doi.org/10.1109/SIU.2016.7496174
  2. [2] Burak Kerim Akkus and Ruket Çakıcı. 2013. Categorization of Turkish News Documents with Morphological Analysis. (2013).
  3. [3] Mehmet Fatih Amasyalı, Sümeyra Balcı, Emrah Mete, and Esra Nur Varlı. 2012. A Comparison of Text Representation Methods for Turkish Text Classification. EMO Scientific Journal 2 (2012). arXiv:1309-5501
  4. [4] M. Fatih Amasyalı and Banu Diri. 2006. Automatic Turkish Text Categorization in Terms of Author, Genre and Gender. In Natural Language Processing and Information Systems, Christian Kop, Günther Fliedl, Heinrich C. Mayr, and Elisabeth Métais (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 221–226.
  5. [5] Çağrı Toraman. 2011. Text Categorization and Ensemble Pruning in Turkish News Portals. Ph.D. Dissertation. Bilkent Uniiversity, Ankara.
  6. [6] Y. Bengio, P. Simard, and P. Frasconi. 1994. Learning Long-term Dependencies with Gradient Descent is Difficult. Trans. Neur. Netw. 5, 2 (March 1994), 157–166. https://doi.org/10.1109/72.279181
  7. [7] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet Allocation. J. Mach. Learn. Res. 3 (March 2003), 993–1022. http://dl.acm.org/citation.cfm?id=944919.944937
  8. [8] Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. CoRR abs/1406.1078 (2014). arXiv:1406.1078 http://arxiv.org/abs/1406.1078

Details

Primary Language

Turkish

Subjects

-

Journal Section

-

Authors

Savaş Yıldırım This is me

Tuğba Yıldız This is me

Publication Date

August 15, 2018

Submission Date

December 28, 2017

Acceptance Date

-

Published in Issue

Year 2018 Volume: 22 Number: 2

APA
Yıldırım, S., & Yıldız, T. (2018). A Comparison of Different Approaches to Document Representation in Turkish Language. Süleyman Demirel Üniversitesi Fen Bilimleri Enstitüsü Dergisi, 22(2), 569-576. https://izlik.org/JA29AX27TY
AMA
1.Yıldırım S, Yıldız T. A Comparison of Different Approaches to Document Representation in Turkish Language. J. Nat. Appl. Sci. 2018;22(2):569-576. https://izlik.org/JA29AX27TY
Chicago
Yıldırım, Savaş, and Tuğba Yıldız. 2018. “A Comparison of Different Approaches to Document Representation in Turkish Language”. Süleyman Demirel Üniversitesi Fen Bilimleri Enstitüsü Dergisi 22 (2): 569-76. https://izlik.org/JA29AX27TY.
EndNote
Yıldırım S, Yıldız T (August 1, 2018) A Comparison of Different Approaches to Document Representation in Turkish Language. Süleyman Demirel Üniversitesi Fen Bilimleri Enstitüsü Dergisi 22 2 569–576.
IEEE
[1]S. Yıldırım and T. Yıldız, “A Comparison of Different Approaches to Document Representation in Turkish Language”, J. Nat. Appl. Sci., vol. 22, no. 2, pp. 569–576, Aug. 2018, [Online]. Available: https://izlik.org/JA29AX27TY
ISNAD
Yıldırım, Savaş - Yıldız, Tuğba. “A Comparison of Different Approaches to Document Representation in Turkish Language”. Süleyman Demirel Üniversitesi Fen Bilimleri Enstitüsü Dergisi 22/2 (August 1, 2018): 569-576. https://izlik.org/JA29AX27TY.
JAMA
1.Yıldırım S, Yıldız T. A Comparison of Different Approaches to Document Representation in Turkish Language. J. Nat. Appl. Sci. 2018;22:569–576.
MLA
Yıldırım, Savaş, and Tuğba Yıldız. “A Comparison of Different Approaches to Document Representation in Turkish Language”. Süleyman Demirel Üniversitesi Fen Bilimleri Enstitüsü Dergisi, vol. 22, no. 2, Aug. 2018, pp. 569-76, https://izlik.org/JA29AX27TY.
Vancouver
1.Savaş Yıldırım, Tuğba Yıldız. A Comparison of Different Approaches to Document Representation in Turkish Language. J. Nat. Appl. Sci. [Internet]. 2018 Aug. 1;22(2):569-76. Available from: https://izlik.org/JA29AX27TY

e-ISSN :1308-6529
Linking ISSN (ISSN-L): 1300-7688

All published articles in the journal can be accessed free of charge and are open access under the Creative Commons CC BY-NC (Attribution-NonCommercial) license. All authors and other journal users are deemed to have accepted this situation. Click here to access detailed information about the CC BY-NC license.