A Comparison of Different Approaches to Document Representation in Turkish Language

Cilt: 22 Sayı: 2 15 Ağustos 2018
  • Savaş Yıldırım
  • Tuğba Yıldız
PDF İndir

A Comparison of Different Approaches to Document Representation in Turkish Language

Öz

Recently, deep learning methods have demonstrated state-of-the-art performance in numerous complex Natural Language Processing (NLP) problems. Easy accessibility of high-performance computing resources and open-source libraries makes Artificial Intelligence (AI) approaches more applicable for researchers. This sudden growth of available techniques shaped and improved standards in the field of NLP. Thus, we find an opportunity to compare different approaches to document representation, owing to various open-source libraries and a large amount of research. We evaluate four different paradigms to represent documents: Traditional bag-of-words approaches, topic modeling, embedding based approach and deep learning. As the main contribution of this article, we aim at evaluating all these representation approaches with suitable machine learning algorithms for document categorization problem in the Turkish language. The supervised architecture uses a benchmark dataset specifically prepared for this language. Within the architecture, we evaluate the representation approaches with corresponding machine learning algorithms such as Support Vector Machine (SVM), multi-nominal Naive Bayes Algorithm (m-NB) and so forth. We conduct a variety of experiments and present successful results for the Turkish document categorization. We also observed that tradition approaches have still comparable results with Neural Network models in terms of document classification.

Anahtar Kelimeler

Kaynakça

  1. [1] B. Açıkalın and N. G. Bayazıt. 2016. The importance of preprocessing in Turkish Text classification. In 2016 24th Signal Processing and Communication Application Conference (SIU). 2053–2056. https://doi.org/10.1109/SIU.2016.7496174
  2. [2] Burak Kerim Akkus and Ruket Çakıcı. 2013. Categorization of Turkish News Documents with Morphological Analysis. (2013).
  3. [3] Mehmet Fatih Amasyalı, Sümeyra Balcı, Emrah Mete, and Esra Nur Varlı. 2012. A Comparison of Text Representation Methods for Turkish Text Classification. EMO Scientific Journal 2 (2012). arXiv:1309-5501
  4. [4] M. Fatih Amasyalı and Banu Diri. 2006. Automatic Turkish Text Categorization in Terms of Author, Genre and Gender. In Natural Language Processing and Information Systems, Christian Kop, Günther Fliedl, Heinrich C. Mayr, and Elisabeth Métais (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 221–226.
  5. [5] Çağrı Toraman. 2011. Text Categorization and Ensemble Pruning in Turkish News Portals. Ph.D. Dissertation. Bilkent Uniiversity, Ankara.
  6. [6] Y. Bengio, P. Simard, and P. Frasconi. 1994. Learning Long-term Dependencies with Gradient Descent is Difficult. Trans. Neur. Netw. 5, 2 (March 1994), 157–166. https://doi.org/10.1109/72.279181
  7. [7] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet Allocation. J. Mach. Learn. Res. 3 (March 2003), 993–1022. http://dl.acm.org/citation.cfm?id=944919.944937
  8. [8] Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. CoRR abs/1406.1078 (2014). arXiv:1406.1078 http://arxiv.org/abs/1406.1078

Ayrıntılar

Birincil Dil

Türkçe

Konular

-

Bölüm

-

Yazarlar

Savaş Yıldırım Bu kişi benim

Tuğba Yıldız Bu kişi benim

Yayımlanma Tarihi

15 Ağustos 2018

Gönderilme Tarihi

28 Aralık 2017

Kabul Tarihi

-

Yayımlandığı Sayı

Yıl 2018 Cilt: 22 Sayı: 2

Kaynak Göster

APA
Yıldırım, S., & Yıldız, T. (2018). A Comparison of Different Approaches to Document Representation in Turkish Language. Süleyman Demirel Üniversitesi Fen Bilimleri Enstitüsü Dergisi, 22(2), 569-576. https://izlik.org/JA29AX27TY
AMA
1.Yıldırım S, Yıldız T. A Comparison of Different Approaches to Document Representation in Turkish Language. Süleyman Demirel Üniv. Fen Bilim. Enst. Derg. 2018;22(2):569-576. https://izlik.org/JA29AX27TY
Chicago
Yıldırım, Savaş, ve Tuğba Yıldız. 2018. “A Comparison of Different Approaches to Document Representation in Turkish Language”. Süleyman Demirel Üniversitesi Fen Bilimleri Enstitüsü Dergisi 22 (2): 569-76. https://izlik.org/JA29AX27TY.
EndNote
Yıldırım S, Yıldız T (01 Ağustos 2018) A Comparison of Different Approaches to Document Representation in Turkish Language. Süleyman Demirel Üniversitesi Fen Bilimleri Enstitüsü Dergisi 22 2 569–576.
IEEE
[1]S. Yıldırım ve T. Yıldız, “A Comparison of Different Approaches to Document Representation in Turkish Language”, Süleyman Demirel Üniv. Fen Bilim. Enst. Derg., c. 22, sy 2, ss. 569–576, Ağu. 2018, [çevrimiçi]. Erişim adresi: https://izlik.org/JA29AX27TY
ISNAD
Yıldırım, Savaş - Yıldız, Tuğba. “A Comparison of Different Approaches to Document Representation in Turkish Language”. Süleyman Demirel Üniversitesi Fen Bilimleri Enstitüsü Dergisi 22/2 (01 Ağustos 2018): 569-576. https://izlik.org/JA29AX27TY.
JAMA
1.Yıldırım S, Yıldız T. A Comparison of Different Approaches to Document Representation in Turkish Language. Süleyman Demirel Üniv. Fen Bilim. Enst. Derg. 2018;22:569–576.
MLA
Yıldırım, Savaş, ve Tuğba Yıldız. “A Comparison of Different Approaches to Document Representation in Turkish Language”. Süleyman Demirel Üniversitesi Fen Bilimleri Enstitüsü Dergisi, c. 22, sy 2, Ağustos 2018, ss. 569-76, https://izlik.org/JA29AX27TY.
Vancouver
1.Savaş Yıldırım, Tuğba Yıldız. A Comparison of Different Approaches to Document Representation in Turkish Language. Süleyman Demirel Üniv. Fen Bilim. Enst. Derg. [Internet]. 01 Ağustos 2018;22(2):569-76. Erişim adresi: https://izlik.org/JA29AX27TY

e-ISSN :1308-6529
Linking ISSN (ISSN-L): 1300-7688

Dergide yayımlanan tüm makalelere ücretiz olarak erişilebilinir ve Creative Commons CC BY-NC Atıf-GayriTicari lisansı ile açık erişime sunulur. Tüm yazarlar ve diğer dergi kullanıcıları bu durumu kabul etmiş sayılırlar. CC BY-NC lisansı hakkında detaylı bilgiye erişmek için tıklayınız.