Türkçe için karşılaştırmalı metin sınıflandırma analizi

Savaş Yıldırım; Tuğba Yıldız

Araştırma Makalesi

Türkçe için karşılaştırmalı metin sınıflandırma analizi

Yıl 2018, Cilt: 24 Sayı: 5, 879 - 886, 12.10.2018

Öz

Metin
Sınıflandırma Doğal Dil İşleme (DDİ) alanında önemli bir yere sahiptir. Son
zamanlarda metinsel verilerin artması ve otomatik etiketlenmesi gerekliliği,
metin sınıflandırma probleminin önemini artırmıştır. Geleneksel yaklaşımlardan
öne çıkan kelime torbası yöntemi yıllardır metin sınıflandırmasında başarılı
olmaktadır. Son zamanlarda sinir ağları dil modelleri DDİ problemlerine
başarılı bir şekilde uygulanmış ve bazı alanlarda büyük başarı kaydetmişlerdir.
Yapay Sinir Ağları (YSA) temelli mimarilerin en önemli avantajı daha etkili
kelime ve metin gösterilimlerin oluşturmasıdır. Bu gösterilimler, geleneksel
yöntemlere göre daha az boyutlu ve daha etkili bulunmuştur. Özellikle
anlambilimsel ve sözdizimsel analizlerde başarılı uygulamalar yapılmıştır. Öte
yandan daha uzun vektörlerle gösterilim kullanan geleneksel kelime torbası yöntemleri,
metin gösterilimleri anlamında hala gücünü korumaktadır. Ancak Türkçe için bu
iki yaklaşımın herhangi bir karşılaştırılması yapılmamıştır. Bu çalışmada,
geleneksel kelime torbası yaklaşımı ile sinir ağı temelli yeni gösterilim
yaklaşımları metin sınıflandırması açısından karşılaştırılmıştır. Bu
çalışmalarda gördük ki etkili özellik seçimleri geleneksel yöntemlerinin hala
yeni kuşak kelime gömme (word embeddings) yaklaşımı ile yarışacak düzeydedir.
Son olarak deneylerimizi bu iki yaklaşım açısından çeşitlendirerek raporladık
ve Türkçe için başarılı metin sınıflandırma mimarisini bu raporda ayrıntılı
tartıştık.

Anahtar Kelimeler

Metin sınıflandırma , Makine öğrenmesi , Yapay sinir ağları

Kaynakça

Salton G, Wong A, Yang CS. “A vector space model for automatic indexing”. Communications of the ACM, 18(11), 613-620, 1975.
Harris, Z. “Distributional structure”. Word, 10(2), 146-162, 1954.
Mikolov T, Chen K, Corrado G, Dean J. “Efficient estimation of word representations in vector space”. Proceedings of Workshop at ICLR. Scottsdale, Arizona 2-4 Mayıs 2013.
Pennington J, Socher R, Manning C. “Glove: Global vectors for word representation”. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25-29 October 2014.
Le Q, Mikolov T. “Distributed representations of sentences and documents”. 31th International Conference on Machine Learning, Beijing, China, 21-26 June 2014.
Amasyalı MF, Diri B. Automatic Turkish text categorization in terms of author, genre and gender. Natural Language Processing and Information Systems, Lecture Notes in Computer Science, Vol 3999, 221-226, Berlin, Heidelberg, Germany, Springer, 2006.
Türkoğlu F, Diri B, Amasyalı MF. Author Attribution of Turkish Texts by Feature Mining. International Conference on Intelligent Computing, Lecture Notes in Computer Science, vol 4681. Springer, Berlin, Heidelberg, 2007.
Amasyalı MF, Balcı S, Mete E, Varlı EN. "Türkçe metinlerin sınıflandırılmasında metin temsil yöntemlerinin performans karşılaştırılması". EMO Bilimsel Dergi, 2(4), 2012.
Kılınç D, Özçift A, Bozyigit F, Yıldırım P, Yücalar F, Borandag E. “TTC-3600: A new benchmark dataset for Turkish text categorization”. Journal of Information Science, 43(2), 174-185, 2015.
Tüfekçi P, Uzun E, Sevinç B. “Text classification of web based news articles by using Turkish grammatical features”. 20th Signal Processing and Communications Applications Conference (SIU), Muğla, Turkiye 18-20 April 2012.
Akkuş BK, Çakıcı R. “Categorization of Turkish news documents with morphological analysis”. 51st Annual Meeting of the ACL Proceedings of the Student Research Workshop, Sofya, Bulgaristan, 5-7 August 2013.
Torunoğlu D, Çakırman E, Ganiz MC, Akyokuş S, Gürbüz MZ. “Analysis of preprocessing methods on classification of Turkish texts.”. International Symposium on Innovations in Intelligent Systems and Applications (INISTA), İstanbul, Türkiye, 15-18 June 2011.
Uysal AK, Günal S. “The impact of preprocessing on text classification”. Information Processing and Management, 50(1), 104-112, 2014.
Yıldırım S. “A knowledge-poor approach to turkish text categorization”. 15th International Conference on Computational Linguistics and Intelligent Text Processing. Katmandu, Nepal, 6-12 April 2014.
Açıkalın B, Bayazıt NG. “The importance of preprocessing in Turkish Text classification”. 24th Signal Processing and Communication Application Conference, Zonguldak, Türkiye, 16-19 May 2016.
Amasyalı MF, Beken A. “Türkçe kelimelerin anlamsal benzerliklerinin ölçülmesi ve metin sınıflandırmada kullanılması”. Signal Processing and Communication Application Conference, Antalya, Türkiye, 9-11 Nisan 2009.
Toraman Ç. Text Categorization and Ensemble Pruning in Turkish News Portals. PhD Thesis, Bilkent University, Ankara, Turkey, 2011.
Schütze H, Hull DA, Pedersen JO. “A comparison of classifiers and document representations for the routing problem”. 18th ACM Conference on Research and Development in Information Retrieval, New York, USA, 9-13 July 1995.
Lewis D, Ringuette M. “A comparison of two learning algorithms for text categorization”. 3rd Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, USA, 11-13 April 1994.
Manning CD, Raghavan P, Schütze H. Introduction to Information Retrieval, 1st Edition, New York, USA, Cambridge University Press, 2008.
Zhang W, Yoshida T, Tang X. “A comparative study of TF*IDF, LSI and multi-words for text classification”. Expert System Application, 38(3), 2758-2765, 2011.
Akin A, Akin MD. “Zemberek, an open source NLP framework for Turkic Languages”. Structure, 10, 1-5, 2007.

A comparative analysis of text classification for Turkish language

Yıl 2018, Cilt: 24 Sayı: 5, 879 - 886, 12.10.2018

Savaş Yıldırım , Tuğba Yıldız

Öz

Text categorization plays
important role in the field of Natural Language Processing. Recently, the rapid
growth in the amount of textual data and requirement of automatic annotation
makes the problem of text categorization more important. As a prominent one of
the traditional methods, the bag-of-words approach has been successfully
applied to text categorization problem for years. Recently, Neural Network
Language Models (NNLM) have achieved successful results for various problems of
Natural Language Processing (NLP). The most important advantage of the NNLM is
to provide effective word and document representations. Those representations
are lower dimensional and are found to be more effective than traditional
methods. They have been exploited successfully for semantic and syntactic
analysis. On the other hand, the traditional bag-of-words approaches that use
one-hot long vector representation are still considered powerful in terms of
their accuracy in document classification. However, comparing these approaches
for Turkish language has not been attempted before. In this study,
we compared them within a variety of analysis. We observed that the traditional
bag-of-word representation utilizing an effective feature selection and a
machine learning algorithm aligned with it have comparable performance with new
generation vector based methods, namely word embeddings. In this study, we have
conducted various experiments comparing these approaches and designated an
effective text categorization architecture for Turkish Language.

Anahtar Kelimeler

Text classification , Machine learning , Artificial neural network

Kaynakça

Salton G, Wong A, Yang CS. “A vector space model for automatic indexing”. Communications of the ACM, 18(11), 613-620, 1975.
Harris, Z. “Distributional structure”. Word, 10(2), 146-162, 1954.
Mikolov T, Chen K, Corrado G, Dean J. “Efficient estimation of word representations in vector space”. Proceedings of Workshop at ICLR. Scottsdale, Arizona 2-4 Mayıs 2013.
Pennington J, Socher R, Manning C. “Glove: Global vectors for word representation”. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25-29 October 2014.
Le Q, Mikolov T. “Distributed representations of sentences and documents”. 31th International Conference on Machine Learning, Beijing, China, 21-26 June 2014.
Amasyalı MF, Diri B. Automatic Turkish text categorization in terms of author, genre and gender. Natural Language Processing and Information Systems, Lecture Notes in Computer Science, Vol 3999, 221-226, Berlin, Heidelberg, Germany, Springer, 2006.
Türkoğlu F, Diri B, Amasyalı MF. Author Attribution of Turkish Texts by Feature Mining. International Conference on Intelligent Computing, Lecture Notes in Computer Science, vol 4681. Springer, Berlin, Heidelberg, 2007.
Amasyalı MF, Balcı S, Mete E, Varlı EN. "Türkçe metinlerin sınıflandırılmasında metin temsil yöntemlerinin performans karşılaştırılması". EMO Bilimsel Dergi, 2(4), 2012.
Kılınç D, Özçift A, Bozyigit F, Yıldırım P, Yücalar F, Borandag E. “TTC-3600: A new benchmark dataset for Turkish text categorization”. Journal of Information Science, 43(2), 174-185, 2015.
Tüfekçi P, Uzun E, Sevinç B. “Text classification of web based news articles by using Turkish grammatical features”. 20th Signal Processing and Communications Applications Conference (SIU), Muğla, Turkiye 18-20 April 2012.
Akkuş BK, Çakıcı R. “Categorization of Turkish news documents with morphological analysis”. 51st Annual Meeting of the ACL Proceedings of the Student Research Workshop, Sofya, Bulgaristan, 5-7 August 2013.
Torunoğlu D, Çakırman E, Ganiz MC, Akyokuş S, Gürbüz MZ. “Analysis of preprocessing methods on classification of Turkish texts.”. International Symposium on Innovations in Intelligent Systems and Applications (INISTA), İstanbul, Türkiye, 15-18 June 2011.
Uysal AK, Günal S. “The impact of preprocessing on text classification”. Information Processing and Management, 50(1), 104-112, 2014.
Yıldırım S. “A knowledge-poor approach to turkish text categorization”. 15th International Conference on Computational Linguistics and Intelligent Text Processing. Katmandu, Nepal, 6-12 April 2014.
Açıkalın B, Bayazıt NG. “The importance of preprocessing in Turkish Text classification”. 24th Signal Processing and Communication Application Conference, Zonguldak, Türkiye, 16-19 May 2016.
Amasyalı MF, Beken A. “Türkçe kelimelerin anlamsal benzerliklerinin ölçülmesi ve metin sınıflandırmada kullanılması”. Signal Processing and Communication Application Conference, Antalya, Türkiye, 9-11 Nisan 2009.
Toraman Ç. Text Categorization and Ensemble Pruning in Turkish News Portals. PhD Thesis, Bilkent University, Ankara, Turkey, 2011.
Schütze H, Hull DA, Pedersen JO. “A comparison of classifiers and document representations for the routing problem”. 18th ACM Conference on Research and Development in Information Retrieval, New York, USA, 9-13 July 1995.
Lewis D, Ringuette M. “A comparison of two learning algorithms for text categorization”. 3rd Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, USA, 11-13 April 1994.
Manning CD, Raghavan P, Schütze H. Introduction to Information Retrieval, 1st Edition, New York, USA, Cambridge University Press, 2008.
Zhang W, Yoshida T, Tang X. “A comparative study of TF*IDF, LSI and multi-words for text classification”. Expert System Application, 38(3), 2758-2765, 2011.
Akin A, Akin MD. “Zemberek, an open source NLP framework for Turkic Languages”. Structure, 10, 1-5, 2007.

Toplam 22 adet kaynakça vardır.

Ayrıntılar

Birincil Dil	Türkçe
Konular	Mühendislik
Bölüm	Araştırma Makalesi
Yazarlar	Savaş Yıldırım 0000-0002-7764-2891 Tuğba Yıldız Bu kişi benim 0000-0002-5868-5407
Yayımlanma Tarihi	12 Ekim 2018
Yayımlandığı Sayı	Yıl 2018 Cilt: 24 Sayı: 5

Kaynak Göster

APA	Yıldırım, S., & Yıldız, T. (2018). Türkçe için karşılaştırmalı metin sınıflandırma analizi. Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi, 24(5), 879-886.
AMA	Yıldırım S, Yıldız T. Türkçe için karşılaştırmalı metin sınıflandırma analizi. Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi. Ekim 2018;24(5):879-886.
Chicago	Yıldırım, Savaş, ve Tuğba Yıldız. “Türkçe için karşılaştırmalı metin sınıflandırma analizi”. Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi 24, sy. 5 (Ekim 2018): 879-86.
EndNote	Yıldırım S, Yıldız T (01 Ekim 2018) Türkçe için karşılaştırmalı metin sınıflandırma analizi. Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi 24 5 879–886.
IEEE	S. Yıldırım ve T. Yıldız, “Türkçe için karşılaştırmalı metin sınıflandırma analizi”, Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi, c. 24, sy. 5, ss. 879–886, 2018.
ISNAD	Yıldırım, Savaş - Yıldız, Tuğba. “Türkçe için karşılaştırmalı metin sınıflandırma analizi”. Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi 24/5 (Ekim2018), 879-886.
JAMA	Yıldırım S, Yıldız T. Türkçe için karşılaştırmalı metin sınıflandırma analizi. Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi. 2018;24:879–886.
MLA	Yıldırım, Savaş ve Tuğba Yıldız. “Türkçe için karşılaştırmalı metin sınıflandırma analizi”. Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi, c. 24, sy. 5, 2018, ss. 879-86.
Vancouver	Yıldırım S, Yıldız T. Türkçe için karşılaştırmalı metin sınıflandırma analizi. Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi. 2018;24(5):879-86.

Makale Dosyaları

Tam Metin

Bu eser Creative Commons Atıf-Ticari Olmayan-Benzer Paylaşım 4.0 Uluslararası lisansı altında lisanslanmıştır .