PDF Zotero Mendeley EndNote BibTex Cite

A Comparison of Different Approaches to Document Representation in Turkish Language

Year 2018, Volume 22, Issue 2, 569 - 576, 15.08.2018

Abstract

Recently, deep learning methods have demonstrated state-of-the-art performance in numerous complex Natural Language Processing (NLP) problems. Easy accessibility of high-performance computing resources and open-source libraries makes Artificial Intelligence (AI) approaches more applicable for researchers. This sudden growth of available techniques shaped and improved standards in the field of NLP. Thus, we find an opportunity to compare different approaches to document representation, owing to various open-source libraries and a large amount of research. We evaluate four different paradigms to represent documents: Traditional bag-of-words approaches, topic modeling, embedding based approach and deep learning. As the main contribution of this article, we aim at evaluating all these representation approaches with suitable machine learning algorithms for document categorization problem in the Turkish language. The supervised architecture uses a benchmark dataset specifically prepared for this language. Within the architecture, we evaluate the representation approaches with corresponding machine learning algorithms such as Support Vector Machine (SVM), multi-nominal Naive Bayes Algorithm (m-NB) and so forth. We conduct a variety of experiments and present successful results for the Turkish document categorization. We also observed that tradition approaches have still comparable results with Neural Network models in terms of document classification.

References

  • [1] B. Açıkalın and N. G. Bayazıt. 2016. The importance of preprocessing in Turkish Text classification. In 2016 24th Signal Processing and Communication Application Conference (SIU). 2053–2056. https://doi.org/10.1109/SIU.2016.7496174
  • [2] Burak Kerim Akkus and Ruket Çakıcı. 2013. Categorization of Turkish News Documents with Morphological Analysis. (2013).
  • [3] Mehmet Fatih Amasyalı, Sümeyra Balcı, Emrah Mete, and Esra Nur Varlı. 2012. A Comparison of Text Representation Methods for Turkish Text Classification. EMO Scientific Journal 2 (2012). arXiv:1309-5501
  • [4] M. Fatih Amasyalı and Banu Diri. 2006. Automatic Turkish Text Categorization in Terms of Author, Genre and Gender. In Natural Language Processing and Information Systems, Christian Kop, Günther Fliedl, Heinrich C. Mayr, and Elisabeth Métais (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 221–226.
  • [5] Çağrı Toraman. 2011. Text Categorization and Ensemble Pruning in Turkish News Portals. Ph.D. Dissertation. Bilkent Uniiversity, Ankara.
  • [6] Y. Bengio, P. Simard, and P. Frasconi. 1994. Learning Long-term Dependencies with Gradient Descent is Difficult. Trans. Neur. Netw. 5, 2 (March 1994), 157–166. https://doi.org/10.1109/72.279181
  • [7] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet Allocation. J. Mach. Learn. Res. 3 (March 2003), 993–1022. http://dl.acm.org/citation.cfm?id=944919.944937
  • [8] Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. CoRR abs/1406.1078 (2014). arXiv:1406.1078 http://arxiv.org/abs/1406.1078
  • [9] Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE 41, 6 (1990), 391–407.
  • [10] M Fatih Amasyali, Aytunç Beken, and Yildiz Teknik Üniversitesi. 2018. Türkçe Kelimelerin Anlamsal Benzerliklerinin Ölçülmesi ve Metin Siniflandirmada Kullanilmasi Measurement of Turkish Word Semantic Similarity and Text Categorization Application. (03 2018).
  • [11] Aysun Güran, Selim Akyokus, Nilgün Güler Bayazıt, and M Zahid Gürbüz. 2009. (07 2009).
  • [12] Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. 2013. Speech recognition with deep recurrent neural networks. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2013, Vancouver, BC, Canada, May 26-31, 2013. 6645–6649. https://doi.org/10.1109/ICASSP.2013.6638947
  • [13] Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, and Jürgen Schmidhuber. 2001. Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies. (2001).
  • [14] I.T. Jolliffe. 2002. Principal Component Analysis. Springer.
  • [15] Daniel Jurafsky and James H. Martin. 2016. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition (3rd ed.). Prentice Hall PTR, Upper Saddle River, NJ, USA.
  • [16] Deniz Kılınç, Akın Özçift, Fatma Bozyigit, Pelin Yıldırım, Fatih Yücalar, and Emin Borandag. 2017. TTC-3600: A new benchmark dataset for Turkish text categorization. Journal of Information Science 43, 2 (2017), 174–185. https://doi.org/10.1177/0165551515620551 arXiv:https://doi.org/10.1177/0165551515620551
  • [17] Quoc Le and Tomas Mikolov. [n. d.]. Distributed Representations of Sentences and Documents. In In NAACL HLT. 2013.
  • [18] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature 521, 7553 (27 May 2015), 436–444. https://doi.org/10.1038/nature14539
  • [19] Yann Lecun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. In Proceedings of the IEEE. 2278–2324.
  • [20] David D. Lewis and Marc Ringuette. 1994. A Comparison of Two Learning Algorithms for Text Categorization. In In Third Annual Symposium on Document Analysis and Information Retrieval. 81–93.
  • [21] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA.
  • [22] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).
  • [23] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In In EMNLP.
  • [24] G. Salton. 1971. The SMART Retrieval System—Experiments in Automatic Document Processing. Prentice-Hall, Inc., Upper Saddle River, NJ, USA.
  • [25] Hinrich Schutze, David A. Hull, and Jan O. Pedersen. 1995. A comparison of classifiers and document representations for the routing problem. In ANNUAL ACM CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL - ACM SIGIR. ACM, 229–237.
  • [26] Jonathan M Cheek Stephen R Briggs. 2007. The role of factor analysis in the development and evaluation of personality scales. (2007).
  • [27] P. Tüfekci, E. Uzun, and B. Sevinç. 2012. Text classification of web based news articles by using Turkish grammatical features. In 2012 20th Signal Processing and Communications Applications Conference (SIU). 1–4. https://doi.org/10.1109/SIU.2012.6204565
  • [28] D. Torunoğlu, E. Çakirman, M. C. Ganiz, S. Akyokuş, and M. Z. Gürbüz. 2011. Analysis of preprocessing methods on classification of Turkish texts. In 2011 International Symposium on Innovations in Intelligent Systems and Applications. 112–117. https://doi.org/10.1109/INISTA.2011.5946084
  • [29] Filiz Türkoğlu, Banu Diri, and M. Fatih Amasyalı. 2007. Author Attribution of Turkish Texts by Feature Mining. In Advanced Intelligent Computing Theories and Applications. With Aspects of Theoretical and Methodological Issues, De-Shuang Huang, Laurent Heutte, and Marco Loog (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 1086–1093.
  • [30] Alper Kursat Uysal and Serkan Gunal. 2014. The impact of preprocessing on text classification. Information Processing and Management 50, 1 (2014), 104 – 112. https://doi.org/10.1016/j.ipm.2013.08.006
  • [31] Savaş Yildirim. 2014. A Knowledge-Poor Approach to Turkish Text Categorization. In Computational Linguistics and Intelligent Text Processing, Alexander Gelbukh (Ed.). Springer Berlin Heidelberg, Berlin, Heidelberg, 428–440.
  • [32] Wenpeng Yin, Katharina Kann, Mo Yu, and Hinrich Schütze. 2017. Comparative Study of CNN and RNN for Natural Language Processing. CoRR abs/1702.01923 (2017). arXiv:1702.01923 http://arxiv.org/abs/1702.01923

Year 2018, Volume 22, Issue 2, 569 - 576, 15.08.2018

Abstract

References

  • [1] B. Açıkalın and N. G. Bayazıt. 2016. The importance of preprocessing in Turkish Text classification. In 2016 24th Signal Processing and Communication Application Conference (SIU). 2053–2056. https://doi.org/10.1109/SIU.2016.7496174
  • [2] Burak Kerim Akkus and Ruket Çakıcı. 2013. Categorization of Turkish News Documents with Morphological Analysis. (2013).
  • [3] Mehmet Fatih Amasyalı, Sümeyra Balcı, Emrah Mete, and Esra Nur Varlı. 2012. A Comparison of Text Representation Methods for Turkish Text Classification. EMO Scientific Journal 2 (2012). arXiv:1309-5501
  • [4] M. Fatih Amasyalı and Banu Diri. 2006. Automatic Turkish Text Categorization in Terms of Author, Genre and Gender. In Natural Language Processing and Information Systems, Christian Kop, Günther Fliedl, Heinrich C. Mayr, and Elisabeth Métais (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 221–226.
  • [5] Çağrı Toraman. 2011. Text Categorization and Ensemble Pruning in Turkish News Portals. Ph.D. Dissertation. Bilkent Uniiversity, Ankara.
  • [6] Y. Bengio, P. Simard, and P. Frasconi. 1994. Learning Long-term Dependencies with Gradient Descent is Difficult. Trans. Neur. Netw. 5, 2 (March 1994), 157–166. https://doi.org/10.1109/72.279181
  • [7] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet Allocation. J. Mach. Learn. Res. 3 (March 2003), 993–1022. http://dl.acm.org/citation.cfm?id=944919.944937
  • [8] Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. CoRR abs/1406.1078 (2014). arXiv:1406.1078 http://arxiv.org/abs/1406.1078
  • [9] Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE 41, 6 (1990), 391–407.
  • [10] M Fatih Amasyali, Aytunç Beken, and Yildiz Teknik Üniversitesi. 2018. Türkçe Kelimelerin Anlamsal Benzerliklerinin Ölçülmesi ve Metin Siniflandirmada Kullanilmasi Measurement of Turkish Word Semantic Similarity and Text Categorization Application. (03 2018).
  • [11] Aysun Güran, Selim Akyokus, Nilgün Güler Bayazıt, and M Zahid Gürbüz. 2009. (07 2009).
  • [12] Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. 2013. Speech recognition with deep recurrent neural networks. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2013, Vancouver, BC, Canada, May 26-31, 2013. 6645–6649. https://doi.org/10.1109/ICASSP.2013.6638947
  • [13] Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, and Jürgen Schmidhuber. 2001. Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies. (2001).
  • [14] I.T. Jolliffe. 2002. Principal Component Analysis. Springer.
  • [15] Daniel Jurafsky and James H. Martin. 2016. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition (3rd ed.). Prentice Hall PTR, Upper Saddle River, NJ, USA.
  • [16] Deniz Kılınç, Akın Özçift, Fatma Bozyigit, Pelin Yıldırım, Fatih Yücalar, and Emin Borandag. 2017. TTC-3600: A new benchmark dataset for Turkish text categorization. Journal of Information Science 43, 2 (2017), 174–185. https://doi.org/10.1177/0165551515620551 arXiv:https://doi.org/10.1177/0165551515620551
  • [17] Quoc Le and Tomas Mikolov. [n. d.]. Distributed Representations of Sentences and Documents. In In NAACL HLT. 2013.
  • [18] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature 521, 7553 (27 May 2015), 436–444. https://doi.org/10.1038/nature14539
  • [19] Yann Lecun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. In Proceedings of the IEEE. 2278–2324.
  • [20] David D. Lewis and Marc Ringuette. 1994. A Comparison of Two Learning Algorithms for Text Categorization. In In Third Annual Symposium on Document Analysis and Information Retrieval. 81–93.
  • [21] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA.
  • [22] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).
  • [23] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In In EMNLP.
  • [24] G. Salton. 1971. The SMART Retrieval System—Experiments in Automatic Document Processing. Prentice-Hall, Inc., Upper Saddle River, NJ, USA.
  • [25] Hinrich Schutze, David A. Hull, and Jan O. Pedersen. 1995. A comparison of classifiers and document representations for the routing problem. In ANNUAL ACM CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL - ACM SIGIR. ACM, 229–237.
  • [26] Jonathan M Cheek Stephen R Briggs. 2007. The role of factor analysis in the development and evaluation of personality scales. (2007).
  • [27] P. Tüfekci, E. Uzun, and B. Sevinç. 2012. Text classification of web based news articles by using Turkish grammatical features. In 2012 20th Signal Processing and Communications Applications Conference (SIU). 1–4. https://doi.org/10.1109/SIU.2012.6204565
  • [28] D. Torunoğlu, E. Çakirman, M. C. Ganiz, S. Akyokuş, and M. Z. Gürbüz. 2011. Analysis of preprocessing methods on classification of Turkish texts. In 2011 International Symposium on Innovations in Intelligent Systems and Applications. 112–117. https://doi.org/10.1109/INISTA.2011.5946084
  • [29] Filiz Türkoğlu, Banu Diri, and M. Fatih Amasyalı. 2007. Author Attribution of Turkish Texts by Feature Mining. In Advanced Intelligent Computing Theories and Applications. With Aspects of Theoretical and Methodological Issues, De-Shuang Huang, Laurent Heutte, and Marco Loog (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 1086–1093.
  • [30] Alper Kursat Uysal and Serkan Gunal. 2014. The impact of preprocessing on text classification. Information Processing and Management 50, 1 (2014), 104 – 112. https://doi.org/10.1016/j.ipm.2013.08.006
  • [31] Savaş Yildirim. 2014. A Knowledge-Poor Approach to Turkish Text Categorization. In Computational Linguistics and Intelligent Text Processing, Alexander Gelbukh (Ed.). Springer Berlin Heidelberg, Berlin, Heidelberg, 428–440.
  • [32] Wenpeng Yin, Katharina Kann, Mo Yu, and Hinrich Schütze. 2017. Comparative Study of CNN and RNN for Natural Language Processing. CoRR abs/1702.01923 (2017). arXiv:1702.01923 http://arxiv.org/abs/1702.01923

Details

Journal Section Articles
Authors

Savaş YILDIRIM This is me


Tuğba YILDIZ This is me

Publication Date August 15, 2018
Published in Issue Year 2018, Volume 22, Issue 2

Cite

Bibtex @ { sdufenbed456349, journal = {Süleyman Demirel Üniversitesi Fen Bilimleri Enstitüsü Dergisi}, issn = {}, eissn = {1308-6529}, address = {}, publisher = {Süleyman Demirel University}, year = {2018}, volume = {22}, pages = {569 - 576}, doi = {}, title = {A Comparison of Different Approaches to Document Representation in Turkish Language}, key = {cite}, author = {Yıldırım, Savaş and Yıldız, Tuğba} }
APA Yıldırım, S. & Yıldız, T. (2018). A Comparison of Different Approaches to Document Representation in Turkish Language . Süleyman Demirel Üniversitesi Fen Bilimleri Enstitüsü Dergisi , 22 (2) , 569-576 . Retrieved from https://dergipark.org.tr/en/pub/sdufenbed/issue/38975/456349
MLA Yıldırım, S. , Yıldız, T. "A Comparison of Different Approaches to Document Representation in Turkish Language" . Süleyman Demirel Üniversitesi Fen Bilimleri Enstitüsü Dergisi 22 (2018 ): 569-576 <https://dergipark.org.tr/en/pub/sdufenbed/issue/38975/456349>
Chicago Yıldırım, S. , Yıldız, T. "A Comparison of Different Approaches to Document Representation in Turkish Language". Süleyman Demirel Üniversitesi Fen Bilimleri Enstitüsü Dergisi 22 (2018 ): 569-576
RIS TY - JOUR T1 - A Comparison of Different Approaches to Document Representation in Turkish Language AU - Savaş Yıldırım , Tuğba Yıldız Y1 - 2018 PY - 2018 N1 - DO - T2 - Süleyman Demirel Üniversitesi Fen Bilimleri Enstitüsü Dergisi JF - Journal JO - JOR SP - 569 EP - 576 VL - 22 IS - 2 SN - -1308-6529 M3 - UR - Y2 - 2021 ER -
EndNote %0 Süleyman Demirel University Journal of Natural and Applied Sciences A Comparison of Different Approaches to Document Representation in Turkish Language %A Savaş Yıldırım , Tuğba Yıldız %T A Comparison of Different Approaches to Document Representation in Turkish Language %D 2018 %J Süleyman Demirel Üniversitesi Fen Bilimleri Enstitüsü Dergisi %P -1308-6529 %V 22 %N 2 %R %U
ISNAD Yıldırım, Savaş , Yıldız, Tuğba . "A Comparison of Different Approaches to Document Representation in Turkish Language". Süleyman Demirel Üniversitesi Fen Bilimleri Enstitüsü Dergisi 22 / 2 (August 2018): 569-576 .
AMA Yıldırım S. , Yıldız T. A Comparison of Different Approaches to Document Representation in Turkish Language. SDÜ Fen Bil Enst Der. 2018; 22(2): 569-576.
Vancouver Yıldırım S. , Yıldız T. A Comparison of Different Approaches to Document Representation in Turkish Language. Süleyman Demirel Üniversitesi Fen Bilimleri Enstitüsü Dergisi. 2018; 22(2): 569-576.
IEEE S. Yıldırım and T. Yıldız , "A Comparison of Different Approaches to Document Representation in Turkish Language", Süleyman Demirel Üniversitesi Fen Bilimleri Enstitüsü Dergisi, vol. 22, no. 2, pp. 569-576, Aug. 2018

e-ISSN: 1308-6529