TF-IDF ve Doc2Vec Tabanlı Türkçe Metin Sınıflandırma Sisteminin Başarım Değerinin Ardışık Kelime Grubu Tespiti ile Arttırılması

Doğancan Kınık; Aysun Güran

doi:10.31590/ejosat.774144

Research Article

Enhancing the Performance of TF-IDF and Doc2Vec based Turkish Text Categorization System with Phrase Modeling

Year 2021, Issue: 21, 323 - 332, 31.01.2021

Doğancan Kınık Aysun Güran

https://doi.org/10.31590/ejosat.774144

Cited By: 3

Abstract

TF-IDF term weighting measure is based on frequency of words in texts. This measure doesn’t capture the semantic relationship between words. Doc2Vec which is based on artificial neural networks can capture the semantic relations between the words and it enables to yield document vectors of a more manageable size. Consecutive word detection has been reported to have important effects on text mining by many studies. Consecutive word phrases are important for expressing the semantic integrity within the texts. In this study, the performances of traditional machine learning classifiers and ensemble learning algorithms are compared on four different Turkish datasets which are vectorized with both traditional TF-IDF term weighting measurement and Doc2Vec method. The classifiers have been applied on 4 different Turkish datasets containing news documents of different lengths. The contributions of our study are “to apply consecutive word detection process to the documents before the classification phase” and “to show that the performances of the applied classifiers’ results have been increased after the consecutive word detection phase is applied”. In addition to the approach based on frequency of words for consecutive word detection, we also use the url links of Turkish Wikipedia. By using consecutive word detection, higher performance values are presented in almost all classification experiments.

Keywords

TF-IDF , Doc2Vec , Phrase detection , Ensemble learning , Text categorization

References

Mikolov T, Chen K, Corrado G, Dean J. (2013), “Efficient estimation of word representations in vector space”. Proceedings of Workshop at ICLR. Scottsdale, Arizona 2-4 Mayıs 2013.
Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on Machine Learning (ICML 2014), pages 1188–1196, Beijing, China.
Ay Karakuş, B., Talo, M., Hallaç, İ. R., & Aydin, G. (2018). Evaluating deep learning models for sentiment classification. Concurrency and Computation: Practice and Experience, e4783.
Karasoy, O., & Ballı, S. (2017, October). Classification Turkish SMS with deep learning tool Word2Vec. In Computer Science and Engineering (UBMK), 2017 International Conference on (pp. 294-297). IEEE.
Şahin, G. (2017, May). Turkish document classification based on Word2Vec and SVM classifier. In 2017 25th signal processing and communications applications conference (SIU) (pp. 1-4). IEEE.
Çelenli, H. İ., Öztürk, S. T., Şahin, G., Gerek, A., & Ganiz, M. C. (2018, September). Document Embedding Based Supervised Methods for Turkish Text Classification. In 2018 3rd International Conference on Computer Science and Engineering (UBMK) (pp. 477-482). IEEE. Sarı, M., & Özbayoğlu, A. M. (2018, September). Classification of Turkish Documents Using Paragraph Vector. In 2018 International Conference on Artificial Intelligence and Data Processing (IDAP) (pp. 1-5). IEEE.
Karcioğlu, A. A., & Aydin, T. (2019, April). Sentiment Analysis of Turkish and English Twitter Feeds Using Word2Vec Model. In 2019 27th Signal Processing and Communications Applications Conference (SIU) (pp. 1-4). IEEE.
Deniz, E., Erbay, H., & Coşar, M. (2019, November). Classification of Turkish E-Mails with Doc2Vec. In 2019 1st International Informatics and Software Engineering Conference (UBMYK) (pp. 1-4). IEEE.
ERŞAHİN, B., AKTAŞ, Ö., Kilinc, D., & ERŞAHİN, M. (2019). A hybrid sentiment analysis method for Turkish. Turkish Journal of Electrical Engineering & Computer Sciences, 27(3), 1780-1793.
Sel, İ., Karci, A., & Hanbay, D. (2019, September). Feature Selection for Text Classification Using Mutual Information. In 2019 International Artificial Intelligence and Data Processing Symposium (IDAP) (pp. 1-4). IEEE.
Erdinҫ, H. Y., & Güran, A. (2019, April). Semi-supervised Turkish Text Categorization with Word2Vec, Doc2Vec and FastText Algorithms. In 2019 27th Signal Processing and Communications Applications Conference (SIU) (pp. 1-4). IEEE.
Güler, G., & Tantuğ, A. C. (2020). Comparison of Turkish Word Representations Trained on Different Morphological Forms. arXiv preprint arXiv:2002.05417.
M. F. Amasyalı and A. Beken, “Türkçe kelimelerin anlamsal benzerliklerinin ölçülmesi ve metin sınıflandırmada kullanılması,” in IEEE signal processing and communications applications conference, Antalya, Turkey, 2009.
Torunoğlu, D., Çakirman, E., Ganiz, M. C., Akyokuş, S., & Gürbüz, M. Z. (2011, June). Analysis of preprocessing methods on classification of Turkish texts. In 2011 International Symposium on Innovations in Intelligent Systems and Applications (pp. 112-117). IEEE.
A. C. Tantug, “Document categorization with modified statistical language models for agglutinative languages,” International Journal of Computational Intelligence Systems, vol. 3, no. 5, pp. 632–645, 2010.

TF-IDF ve Doc2Vec Tabanlı Türkçe Metin Sınıflandırma Sisteminin Başarım Değerinin Ardışık Kelime Grubu Tespiti ile Arttırılması

Year 2021, Issue: 21, 323 - 332, 31.01.2021

Doğancan Kınık Aysun Güran

https://doi.org/10.31590/ejosat.774144

Cited By: 3

Abstract

TF-IDF terim ağırlıklandırma ölçümü kelimelerin metinler içinde geçme sıklığı bilgisine dayalıdır. Bu ölçüm kelimeler arasındaki anlamsal ilişkiyi barındırmamaktadır. Yapay sinir ağlarına dayalı olan Doc2Vec metodu kelimeler ve kelimeleri içeren dokümanlar arasındaki anlamsal ilişkiyi barındırmakta ve yönetilebilir boyutlu doküman vektörlerinin elde edilmesini sağlamaktadır. Ardışık kelime gurubu tespitinin metin madenciliği üzerindeki olumlu etkileri literatürde sunulan pek çok çalışma tarafından belirtilmiştir. Ardışık kelime gurubu tespiti doküman içindeki anlamsal bütünlüğün sağlanması açısından önemlidir. Bu çalışmada, hem geleneksel TF-IDF terim ağırlıklandırma ölçümünün, hem de YSA’lara dayalı bir yöntem olan Doc2Vec yönteminin kullanımı ile vektörleştirilen dokümanlar üzerinde temel makine öğrenmesi sınıflandırıcılarının ve topluluk öğrenmesi algoritmalarının başarım değerleri kıyaslanmıştır. Çalışmamızda temel sınıflandırıclar olarak Naive Bayes, K-En yakın komşuluk, Lojistik Regresyon, Karar Destek Makineleri, Karar Ağaçları, Çok Katmanlı Algılayıcılar ve topluluk öğrenmesi metotlarından Rassal Orman, Torbalama ve Adaboost algoritmaları kullanılmıştır. Ayrıca son olarak en başarılı üç sınıflandırma algoritması Çoğunluk oylaması ile birleştirilmiş ve elde edilen sonuçlar paylaşılmıştır. Sınıflandırıcılar farklı uzunluklarda haber dokümanlarını içeren 4 farklı Türkçe veri kümesi üzerinde uygulanmıştır. Çalışmamızın literatüre olan katkısı sınıflandırma aşamasına geçilmeden önce dokümanların içindeki ardışık kelime grubu tespitinin gerçekleştirilmesi ve dokümanların bu kelime öbeklerinin tek bir kelime gibi ele alınmasıyla vektörleştirildikten sonra, uygulanan sınıflandırıcıların başarım değerlerinin arttığının gösterilmesi olmuştur. Ardışık kelime grubu tespiti için kelimelerin birlikte geçme sıklığı prensibine dayalı olan bir prensip dışında, Türkçe Vikipedi’nin kelime bağlantıları da kullanılmış ve dokümanlar içinde az sayıda geçmesine rağmen anlamlı olan ardışık kelime öbeklerinin tespiti gerçekleştirilebilmiştir. Ardışık kelime grubu tespiti ile sınıflandırma deneylerinin hemen hemen tümünde daha yüksek başarım değerleri elde edilmiştir.

Keywords

TF-IDF , Doc2Vec , Ardışık kelime grubu tespiti , Topluluk öğrenmesi , Metin sınıflama

References

Mikolov T, Chen K, Corrado G, Dean J. (2013), “Efficient estimation of word representations in vector space”. Proceedings of Workshop at ICLR. Scottsdale, Arizona 2-4 Mayıs 2013.
Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on Machine Learning (ICML 2014), pages 1188–1196, Beijing, China.
Ay Karakuş, B., Talo, M., Hallaç, İ. R., & Aydin, G. (2018). Evaluating deep learning models for sentiment classification. Concurrency and Computation: Practice and Experience, e4783.
Karasoy, O., & Ballı, S. (2017, October). Classification Turkish SMS with deep learning tool Word2Vec. In Computer Science and Engineering (UBMK), 2017 International Conference on (pp. 294-297). IEEE.
Şahin, G. (2017, May). Turkish document classification based on Word2Vec and SVM classifier. In 2017 25th signal processing and communications applications conference (SIU) (pp. 1-4). IEEE.
Çelenli, H. İ., Öztürk, S. T., Şahin, G., Gerek, A., & Ganiz, M. C. (2018, September). Document Embedding Based Supervised Methods for Turkish Text Classification. In 2018 3rd International Conference on Computer Science and Engineering (UBMK) (pp. 477-482). IEEE. Sarı, M., & Özbayoğlu, A. M. (2018, September). Classification of Turkish Documents Using Paragraph Vector. In 2018 International Conference on Artificial Intelligence and Data Processing (IDAP) (pp. 1-5). IEEE.
Karcioğlu, A. A., & Aydin, T. (2019, April). Sentiment Analysis of Turkish and English Twitter Feeds Using Word2Vec Model. In 2019 27th Signal Processing and Communications Applications Conference (SIU) (pp. 1-4). IEEE.
Deniz, E., Erbay, H., & Coşar, M. (2019, November). Classification of Turkish E-Mails with Doc2Vec. In 2019 1st International Informatics and Software Engineering Conference (UBMYK) (pp. 1-4). IEEE.
ERŞAHİN, B., AKTAŞ, Ö., Kilinc, D., & ERŞAHİN, M. (2019). A hybrid sentiment analysis method for Turkish. Turkish Journal of Electrical Engineering & Computer Sciences, 27(3), 1780-1793.
Sel, İ., Karci, A., & Hanbay, D. (2019, September). Feature Selection for Text Classification Using Mutual Information. In 2019 International Artificial Intelligence and Data Processing Symposium (IDAP) (pp. 1-4). IEEE.
Erdinҫ, H. Y., & Güran, A. (2019, April). Semi-supervised Turkish Text Categorization with Word2Vec, Doc2Vec and FastText Algorithms. In 2019 27th Signal Processing and Communications Applications Conference (SIU) (pp. 1-4). IEEE.
Güler, G., & Tantuğ, A. C. (2020). Comparison of Turkish Word Representations Trained on Different Morphological Forms. arXiv preprint arXiv:2002.05417.
M. F. Amasyalı and A. Beken, “Türkçe kelimelerin anlamsal benzerliklerinin ölçülmesi ve metin sınıflandırmada kullanılması,” in IEEE signal processing and communications applications conference, Antalya, Turkey, 2009.
Torunoğlu, D., Çakirman, E., Ganiz, M. C., Akyokuş, S., & Gürbüz, M. Z. (2011, June). Analysis of preprocessing methods on classification of Turkish texts. In 2011 International Symposium on Innovations in Intelligent Systems and Applications (pp. 112-117). IEEE.
A. C. Tantug, “Document categorization with modified statistical language models for agglutinative languages,” International Journal of Computational Intelligence Systems, vol. 3, no. 5, pp. 632–645, 2010.

There are 15 citations in total.

Details

Primary Language	Turkish
Subjects	Engineering
Journal Section	Articles
Authors	Doğancan Kınık This is me 0000-0003-0207-7194 Aysun Güran 0000-0001-7066-0635
Publication Date	January 31, 2021
Published in Issue	Year 2021 Issue: 21

Cite

APA	Kınık, D., & Güran, A. (2021). TF-IDF ve Doc2Vec Tabanlı Türkçe Metin Sınıflandırma Sisteminin Başarım Değerinin Ardışık Kelime Grubu Tespiti ile Arttırılması. Avrupa Bilim Ve Teknoloji Dergisi(21), 323-332. https://doi.org/10.31590/ejosat.774144

Avrupa Bilim ve Teknoloji Dergisi

Enhancing the Performance of TF-IDF and Doc2Vec based Turkish Text Categorization System with Phrase Modeling

Abstract

Keywords

References

TF-IDF ve Doc2Vec Tabanlı Türkçe Metin Sınıflandırma Sisteminin Başarım Değerinin Ardışık Kelime Grubu Tespiti ile Arttırılması

Abstract

Keywords

References

Details

Cite

Cited By

Duygu Analizi İçin Veri Madenciliği Sınıflandırma Algoritmalarının Karşılaştırılması

European Journal of Science and Technology

https://doi.org/10.31590/ejosat.905259

Application of a Combined Approach of Text Mining and QFD Methodology Based on Single Valued Neutrosophic Numbers for Efficient Curriculum Design

Alphanumeric Journal

https://doi.org/10.17093/alphanumeric.1127620

Medya İçeriklerinde Kamu Diplomasisinin Metin Madenciliğiyle Analizi: Cumhurbaşkanlığı Millet Kütüphanesi Örneği

Turk Kutuphaneciligi - Turkish Librarianship

https://doi.org/10.24146/tk.1625263