TF-IDF, Word2vec ve Fasttext Vektör Model Yöntemleri ile Türkçe Haber Metinlerinin Sınıflandırılması

Özer Çelik; Burak Can Koç

doi:10.21205/deufmd.2021236710

Research Article

TF-IDF, Word2vec ve Fasttext Vektör Model Yöntemleri ile Türkçe Haber Metinlerinin Sınıflandırılması

Year 2021, Volume: 23 Issue: 67 , 121 - 127 , 15.01.2021

Özer Çelik , Burak Can Koç

https://doi.org/10.21205/deufmd.2021236710

Cited By: 12

https://izlik.org/JA72CN25LS

Abstract

Bilgisayar ve internetin hayatımıza girmesi ile bilgiye erişmek daha kolay hale gelmiştir. İnternete ulaşımın kolaylaşması ve internet kullanıcılarının artması sonucu veri miktarı da her geçen saniye büyümektedir. Ancak doğru bilgiye erişebilmek için verilerin sınıflandırılması gereklidir. Sınıflandırma, verilerin belirli bir anlamsal kategoriye göre ayrılması işlemidir. Dijital belgelerin anlamsal kategorilere ayrılması, metnin ulaşılabilirliğini önemli ölçüde etkilemektedir. Bu çalışmada, farklı Türkçe haber kaynaklarından elde edilen veri kümesi üzerinde metin sınıflandırma çalışması yapılmıştır. Öncelikli olarak haber metinleri ön işlemeden geçirilmiş ve gövdelenmiştir. Ön işlemeden geçirilen metinler Tfidfvectorizer, Word2Vec ve FastText yöntemleri ile ayrı ayrı vektörize edildikten sonra Destek Vektör Makinesi (Support Vector Machine, SVM), Naive Bayes, Logistic Regression, Random Forest ve Yapay Sinir Ağı (Artificial Neural Network, ANN) yöntemleri ile sınıflandırılmıştır. Yapılan çalışma sonucuna göre en yüksek başarı oranı %95,75 ile FastText yöntemi ve vektör modeli ile elde edilen metnin SVM ile sınıflandırılmasından elde edilmiştir.

Keywords

Metin Sınıflandırma , Türkçe Haber , TF-IDF , Word2Vec , Fasttext

References

Vapnik V. The nature of statistical learning theory. Springer, 2nd edition, 2013; New York, USA. pp: 32-40.
Joachims, T. (1999, June). Transductive inference for text classification using support vector machines. In Icml (Vol. 99, pp. 200-209).
Khan, Aurangzeb, et al. "A review of machine learning algorithms for text-documents classification." Journal of advances in information technology 1.1 (2010): 4-20.
Liu, Y., Liu, Z., Chua, T. S., & Sun, M. (2015, February). Topical word embeddings. In Twenty-Ninth AAAI Conference on Artificial Intelligence.
Ramos, J. Using tf-idf to determine word relevance in document queries. In Proceedings of the first instructional conference on machine learning 2003; (Vol. 242, pp. 133-142).
Mikolov T, Chen K, Corrado G, Dean J. (2013), “Efficient estimation of word representations in vector space”. Proceedings of Workshop at ICLR. Scottsdale, Arizona 2-4 Mayıs 2013.
Joulin, A., Grave, E., Bojanowski, P., & Mikolov, T. (2016). Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759
https://www.researchgate.net/publication/310799885_Generalized_Confusion_Matrix_for_Multiple_Classes
Amasyali, M. F., & Yildirim, T. (2004, April). Automatic text categorization of news articles. In Proceedings of the IEEE 12th Signal Processing and Communications Applications Conference, 2004. (pp. 224-226). IEEE.
Tüfekci, P., Uzun, E., & Sevinç, B. (2012, April). Text classification of web based news articles by using Turkish grammatical features. In 2012 20th Signal Processing and Communications Applications Conference (SIU) (pp. 1-4). IEEE.
Sen, M. U., & Yanıkoğlu, B. (2018, May). Document classification of SuDer Turkish news corpora. In 2018 26th Signal Processing and Communications Applications Conference (SIU) (pp. 1-4). IEEE.
Acı, Ç. İ., & Çırak, A. Türkçe Haber Metinlerinin Konvolüsyonel Sinir Ağları ve Word2Vec Kullanılarak Sınıflandırılması. Bilişim Teknolojileri Dergisi, 12(3), 219-228.
Erdinҫ, H. Y., & Güran, A. (2019, April). Semi-supervised Turkish Text Categorization with Word2Vec, Doc2Vec and FastText Algorithms. In 2019 27th Signal Processing and Communications Applications Conference (SIU) (pp. 1-4). IEEE.

Classification of Turkish News Text by TF-IDF, Word2vec And Fasttext Vector Model Methods

Year 2021, Volume: 23 Issue: 67 , 121 - 127 , 15.01.2021

Özer Çelik , Burak Can Koç

https://doi.org/10.21205/deufmd.2021236710

Cited By: 12

https://izlik.org/JA72CN25LS

Abstract

Accessing information has become very simple with computers and internet. As the internet access is easier and the internet users increase, the amount of data is growing every second. However, in order to access correct information, data must be classified. Classification is the process of separating data according to a certain semantic category. Dividing digital documents into semantic categories significantly affects the availability of the text. In this study, a text classification study was carried out on a data set obtained from different Turkish news sources. After the pre-processed texts are separately vectorized with Tfidfvectorizer, Word2Vec and FastText methods, they are classified with Support Vector Machine (SVM), Naive Bayes, Logistic Regression, Random Forest and Artificial Neural Network (ANN) methods. According to the results of the study, the highest success rate was obtained from the classification of the text gained with FastText method and vector model with 95.75% by SVM.

Keywords

Text Classification , Turkish News , TF-IDF , Word2Vec , Fasttext

References

Vapnik V. The nature of statistical learning theory. Springer, 2nd edition, 2013; New York, USA. pp: 32-40.
Joachims, T. (1999, June). Transductive inference for text classification using support vector machines. In Icml (Vol. 99, pp. 200-209).
Khan, Aurangzeb, et al. "A review of machine learning algorithms for text-documents classification." Journal of advances in information technology 1.1 (2010): 4-20.
Liu, Y., Liu, Z., Chua, T. S., & Sun, M. (2015, February). Topical word embeddings. In Twenty-Ninth AAAI Conference on Artificial Intelligence.
Ramos, J. Using tf-idf to determine word relevance in document queries. In Proceedings of the first instructional conference on machine learning 2003; (Vol. 242, pp. 133-142).
Mikolov T, Chen K, Corrado G, Dean J. (2013), “Efficient estimation of word representations in vector space”. Proceedings of Workshop at ICLR. Scottsdale, Arizona 2-4 Mayıs 2013.
Joulin, A., Grave, E., Bojanowski, P., & Mikolov, T. (2016). Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759
https://www.researchgate.net/publication/310799885_Generalized_Confusion_Matrix_for_Multiple_Classes
Amasyali, M. F., & Yildirim, T. (2004, April). Automatic text categorization of news articles. In Proceedings of the IEEE 12th Signal Processing and Communications Applications Conference, 2004. (pp. 224-226). IEEE.
Tüfekci, P., Uzun, E., & Sevinç, B. (2012, April). Text classification of web based news articles by using Turkish grammatical features. In 2012 20th Signal Processing and Communications Applications Conference (SIU) (pp. 1-4). IEEE.
Sen, M. U., & Yanıkoğlu, B. (2018, May). Document classification of SuDer Turkish news corpora. In 2018 26th Signal Processing and Communications Applications Conference (SIU) (pp. 1-4). IEEE.
Acı, Ç. İ., & Çırak, A. Türkçe Haber Metinlerinin Konvolüsyonel Sinir Ağları ve Word2Vec Kullanılarak Sınıflandırılması. Bilişim Teknolojileri Dergisi, 12(3), 219-228.
Erdinҫ, H. Y., & Güran, A. (2019, April). Semi-supervised Turkish Text Categorization with Word2Vec, Doc2Vec and FastText Algorithms. In 2019 27th Signal Processing and Communications Applications Conference (SIU) (pp. 1-4). IEEE.

There are 13 citations in total.

Details

Primary Language	Turkish
Subjects	Engineering
Journal Section	Research Article
Authors	Özer Çelik 0000-0002-4409-3101 Burak Can Koç 0000-0003-1831-4182
Publication Date	January 15, 2021
DOI	https://doi.org/10.21205/deufmd.2021236710
IZ	https://izlik.org/JA72CN25LS
Published in Issue	Year 2021 Volume: 23 Issue: 67

Cite

Vancouver	1.Özer Çelik, Burak Can Koç. TF-IDF, Word2vec ve Fasttext Vektör Model Yöntemleri ile Türkçe Haber Metinlerinin Sınıflandırılması. DEUFMD. 2021 Jan. 1;23(67):121-7. doi:10.21205/deufmd.2021236710

Cited By

News image text classification algorithm with bidirectional encoder representations from transformers model

Journal of Electronic Imaging

https://doi.org/10.1117/1.JEI.32.1.011217

Türkçe Sosyal Medya Yorumlarındaki Siber Zorbalığın Derin Öğrenme ile Tespiti

European Journal of Science and Technology

https://doi.org/10.31590/ejosat.987259

A Deep Learning Model for the Normalization of Institution Names by Multi-Source Literature Feature Fusion: Algorithm Development (Preprint)

JMIR Formative Research

https://doi.org/10.2196/47434

Çizgeler Üzerinde Farklı Ağırlıklandırma Yöntemleri Ve Merkezilik Ölçütleri İle Çıkarımsal Metin Özetleme

Fırat Üniversitesi Mühendislik Bilimleri Dergisi

https://doi.org/10.35234/fumbd.1155617

Personalized News Recommendation System

İstanbul Ticaret Üniversitesi Teknoloji ve Uygulamalı Bilimler Dergisi

https://doi.org/10.56809/icujtas.1193993

Classification of News Texts from Different Languages with Machine Learning Algorithms

Journal of Soft Computing and Artificial Intelligence

https://doi.org/10.55195/jscai.1311380

Fusion of the word2vec word embedding model and cluster analysis for the communication of music intangible cultural heritage

Scientific Reports

https://doi.org/10.1038/s41598-023-49619-8

Graf Sinir Ağları ile İlişkisel Türkçe Metin Sınıflandırma

Journal of Polytechnic

https://doi.org/10.2339/politeknik.1423293

An Analysis of Intelligent Turkish Text Classification Models for Routing Calls in Call Centers: A Case Study on the Republic of Turkiye Ministry of Trade Call Center

Sakarya University Journal of Computer and Information Sciences

https://doi.org/10.35377/saucis...1402414

Sentiment Analysis and Automatic Response Generation for E-Commerce Comments

Natural and Applied Sciences Journal

https://doi.org/10.38061/idunas.1596100

Classification of Brain MRI Images using Deep Learning: The DeiT3 Model and the Use of Feature Fusion Methods

Computers and Electronics in Medicine

https://doi.org/10.69882/adba.cem.2026019

OHE-RWNet: a novel neural network-based vectorization method for fake review detection

International Journal of Machine Learning and Cybernetics

https://doi.org/10.1007/s13042-025-02847-y

Article Files

Full Text

This journal is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).

download?token=eyJhdXRoX3JvbGVzIjpbXSwiZW5kcG9pbnQiOiJmaWxlIiwicGF0aCI6IjliNTAvMDBjMi8xZmIxLzY5MjZmZDIyOGE1NzgyLjA3MzU5MTk2LnBuZyIsImV4cCI6MTc2NDE2OTMzMSwibm9uY2UiOiI2MTU1ODg1NGZlYzhkZTA1OThkNTU2NGFmYTQzYTc0YiJ9.O5b4Ex8bMlFv5797LL8VnE9YWS_X5880dfbmOp2-kc8