TF-IDF, Word2vec ve Fasttext Vektör Model Yöntemleri ile Türkçe Haber Metinlerinin Sınıflandırılması

Özer Çelik; Burak Can Koç

doi:10.21205/deufmd.2021236710

Research Article

TF-IDF, Word2vec ve Fasttext Vektör Model Yöntemleri ile Türkçe Haber Metinlerinin Sınıflandırılması

Year 2021, Volume: 23 Issue: 67, 121 - 127, 15.01.2021

Özer Çelik , Burak Can Koç

https://doi.org/10.21205/deufmd.2021236710

Cited By: 9

Abstract

Bilgisayar ve internetin hayatımıza girmesi ile bilgiye erişmek daha kolay hale gelmiştir. İnternete ulaşımın kolaylaşması ve internet kullanıcılarının artması sonucu veri miktarı da her geçen saniye büyümektedir. Ancak doğru bilgiye erişebilmek için verilerin sınıflandırılması gereklidir. Sınıflandırma, verilerin belirli bir anlamsal kategoriye göre ayrılması işlemidir. Dijital belgelerin anlamsal kategorilere ayrılması, metnin ulaşılabilirliğini önemli ölçüde etkilemektedir. Bu çalışmada, farklı Türkçe haber kaynaklarından elde edilen veri kümesi üzerinde metin sınıflandırma çalışması yapılmıştır. Öncelikli olarak haber metinleri ön işlemeden geçirilmiş ve gövdelenmiştir. Ön işlemeden geçirilen metinler Tfidfvectorizer, Word2Vec ve FastText yöntemleri ile ayrı ayrı vektörize edildikten sonra Destek Vektör Makinesi (Support Vector Machine, SVM), Naive Bayes, Logistic Regression, Random Forest ve Yapay Sinir Ağı (Artificial Neural Network, ANN) yöntemleri ile sınıflandırılmıştır. Yapılan çalışma sonucuna göre en yüksek başarı oranı %95,75 ile FastText yöntemi ve vektör modeli ile elde edilen metnin SVM ile sınıflandırılmasından elde edilmiştir.

Keywords

Metin Sınıflandırma, Türkçe Haber, TF-IDF, Word2Vec, Fasttext

References

Vapnik V. The nature of statistical learning theory. Springer, 2nd edition, 2013; New York, USA. pp: 32-40.
Joachims, T. (1999, June). Transductive inference for text classification using support vector machines. In Icml (Vol. 99, pp. 200-209).
Khan, Aurangzeb, et al. "A review of machine learning algorithms for text-documents classification." Journal of advances in information technology 1.1 (2010): 4-20.
Liu, Y., Liu, Z., Chua, T. S., & Sun, M. (2015, February). Topical word embeddings. In Twenty-Ninth AAAI Conference on Artificial Intelligence.
Ramos, J. Using tf-idf to determine word relevance in document queries. In Proceedings of the first instructional conference on machine learning 2003; (Vol. 242, pp. 133-142).
Mikolov T, Chen K, Corrado G, Dean J. (2013), “Efficient estimation of word representations in vector space”. Proceedings of Workshop at ICLR. Scottsdale, Arizona 2-4 Mayıs 2013.
Joulin, A., Grave, E., Bojanowski, P., & Mikolov, T. (2016). Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759
https://www.researchgate.net/publication/310799885_Generalized_Confusion_Matrix_for_Multiple_Classes
Amasyali, M. F., & Yildirim, T. (2004, April). Automatic text categorization of news articles. In Proceedings of the IEEE 12th Signal Processing and Communications Applications Conference, 2004. (pp. 224-226). IEEE.
Tüfekci, P., Uzun, E., & Sevinç, B. (2012, April). Text classification of web based news articles by using Turkish grammatical features. In 2012 20th Signal Processing and Communications Applications Conference (SIU) (pp. 1-4). IEEE.
Sen, M. U., & Yanıkoğlu, B. (2018, May). Document classification of SuDer Turkish news corpora. In 2018 26th Signal Processing and Communications Applications Conference (SIU) (pp. 1-4). IEEE.
Acı, Ç. İ., & Çırak, A. Türkçe Haber Metinlerinin Konvolüsyonel Sinir Ağları ve Word2Vec Kullanılarak Sınıflandırılması. Bilişim Teknolojileri Dergisi, 12(3), 219-228.
Erdinҫ, H. Y., & Güran, A. (2019, April). Semi-supervised Turkish Text Categorization with Word2Vec, Doc2Vec and FastText Algorithms. In 2019 27th Signal Processing and Communications Applications Conference (SIU) (pp. 1-4). IEEE.

Classification of Turkish News Text by TF-IDF, Word2vec And Fasttext Vector Model Methods

Year 2021, Volume: 23 Issue: 67, 121 - 127, 15.01.2021

Özer Çelik , Burak Can Koç

https://doi.org/10.21205/deufmd.2021236710

Cited By: 9

Abstract

Accessing information has become very simple with computers and internet. As the internet access is easier and the internet users increase, the amount of data is growing every second. However, in order to access correct information, data must be classified. Classification is the process of separating data according to a certain semantic category. Dividing digital documents into semantic categories significantly affects the availability of the text. In this study, a text classification study was carried out on a data set obtained from different Turkish news sources. After the pre-processed texts are separately vectorized with Tfidfvectorizer, Word2Vec and FastText methods, they are classified with Support Vector Machine (SVM), Naive Bayes, Logistic Regression, Random Forest and Artificial Neural Network (ANN) methods. According to the results of the study, the highest success rate was obtained from the classification of the text gained with FastText method and vector model with 95.75% by SVM.

Keywords

Text Classification, Turkish News, TF-IDF, Word2Vec, Fasttext

References

Vapnik V. The nature of statistical learning theory. Springer, 2nd edition, 2013; New York, USA. pp: 32-40.
Joachims, T. (1999, June). Transductive inference for text classification using support vector machines. In Icml (Vol. 99, pp. 200-209).
Khan, Aurangzeb, et al. "A review of machine learning algorithms for text-documents classification." Journal of advances in information technology 1.1 (2010): 4-20.
Liu, Y., Liu, Z., Chua, T. S., & Sun, M. (2015, February). Topical word embeddings. In Twenty-Ninth AAAI Conference on Artificial Intelligence.
Ramos, J. Using tf-idf to determine word relevance in document queries. In Proceedings of the first instructional conference on machine learning 2003; (Vol. 242, pp. 133-142).
Mikolov T, Chen K, Corrado G, Dean J. (2013), “Efficient estimation of word representations in vector space”. Proceedings of Workshop at ICLR. Scottsdale, Arizona 2-4 Mayıs 2013.
Joulin, A., Grave, E., Bojanowski, P., & Mikolov, T. (2016). Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759
https://www.researchgate.net/publication/310799885_Generalized_Confusion_Matrix_for_Multiple_Classes
Amasyali, M. F., & Yildirim, T. (2004, April). Automatic text categorization of news articles. In Proceedings of the IEEE 12th Signal Processing and Communications Applications Conference, 2004. (pp. 224-226). IEEE.
Tüfekci, P., Uzun, E., & Sevinç, B. (2012, April). Text classification of web based news articles by using Turkish grammatical features. In 2012 20th Signal Processing and Communications Applications Conference (SIU) (pp. 1-4). IEEE.
Sen, M. U., & Yanıkoğlu, B. (2018, May). Document classification of SuDer Turkish news corpora. In 2018 26th Signal Processing and Communications Applications Conference (SIU) (pp. 1-4). IEEE.
Acı, Ç. İ., & Çırak, A. Türkçe Haber Metinlerinin Konvolüsyonel Sinir Ağları ve Word2Vec Kullanılarak Sınıflandırılması. Bilişim Teknolojileri Dergisi, 12(3), 219-228.
Erdinҫ, H. Y., & Güran, A. (2019, April). Semi-supervised Turkish Text Categorization with Word2Vec, Doc2Vec and FastText Algorithms. In 2019 27th Signal Processing and Communications Applications Conference (SIU) (pp. 1-4). IEEE.

There are 13 citations in total.

Details

Primary Language	Turkish
Subjects	Engineering
Journal Section	Research Article
Authors	Özer Çelik 0000-0002-4409-3101 Burak Can Koç 0000-0003-1831-4182
Publication Date	January 15, 2021
Published in Issue	Year 2021 Volume: 23 Issue: 67

Cite

APA	Çelik, Ö., & Koç, B. C. (2021). TF-IDF, Word2vec ve Fasttext Vektör Model Yöntemleri ile Türkçe Haber Metinlerinin Sınıflandırılması. Dokuz Eylül Üniversitesi Mühendislik Fakültesi Fen Ve Mühendislik Dergisi, 23(67), 121-127. https://doi.org/10.21205/deufmd.2021236710
AMA	Çelik Ö, Koç BC. TF-IDF, Word2vec ve Fasttext Vektör Model Yöntemleri ile Türkçe Haber Metinlerinin Sınıflandırılması. DEUFMD. January 2021;23(67):121-127. doi:10.21205/deufmd.2021236710
Chicago	Çelik, Özer, and Burak Can Koç. “TF-IDF, Word2vec Ve Fasttext Vektör Model Yöntemleri Ile Türkçe Haber Metinlerinin Sınıflandırılması”. Dokuz Eylül Üniversitesi Mühendislik Fakültesi Fen Ve Mühendislik Dergisi 23, no. 67 (January 2021): 121-27. https://doi.org/10.21205/deufmd.2021236710.
EndNote	Çelik Ö, Koç BC (January 1, 2021) TF-IDF, Word2vec ve Fasttext Vektör Model Yöntemleri ile Türkçe Haber Metinlerinin Sınıflandırılması. Dokuz Eylül Üniversitesi Mühendislik Fakültesi Fen ve Mühendislik Dergisi 23 67 121–127.
IEEE	Ö. Çelik and B. C. Koç, “TF-IDF, Word2vec ve Fasttext Vektör Model Yöntemleri ile Türkçe Haber Metinlerinin Sınıflandırılması”, DEUFMD, vol. 23, no. 67, pp. 121–127, 2021, doi: 10.21205/deufmd.2021236710.
ISNAD	Çelik, Özer - Koç, Burak Can. “TF-IDF, Word2vec Ve Fasttext Vektör Model Yöntemleri Ile Türkçe Haber Metinlerinin Sınıflandırılması”. Dokuz Eylül Üniversitesi Mühendislik Fakültesi Fen ve Mühendislik Dergisi 23/67 (January 2021), 121-127. https://doi.org/10.21205/deufmd.2021236710.
JAMA	Çelik Ö, Koç BC. TF-IDF, Word2vec ve Fasttext Vektör Model Yöntemleri ile Türkçe Haber Metinlerinin Sınıflandırılması. DEUFMD. 2021;23:121–127.
MLA	Çelik, Özer and Burak Can Koç. “TF-IDF, Word2vec Ve Fasttext Vektör Model Yöntemleri Ile Türkçe Haber Metinlerinin Sınıflandırılması”. Dokuz Eylül Üniversitesi Mühendislik Fakültesi Fen Ve Mühendislik Dergisi, vol. 23, no. 67, 2021, pp. 121-7, doi:10.21205/deufmd.2021236710.
Vancouver	Çelik Ö, Koç BC. TF-IDF, Word2vec ve Fasttext Vektör Model Yöntemleri ile Türkçe Haber Metinlerinin Sınıflandırılması. DEUFMD. 2021;23(67):121-7.