Research Article
BibTex RIS Cite

Classification of News Texts from Different Languages with Machine Learning Algorithms

Year 2023, Volume: 4 Issue: 1, 29 - 37, 25.06.2023
https://doi.org/10.55195/jscai.1311380

Abstract

As a result of the developments in technology, the internet is accepted as one of the most important sources of information today. Although it is possible to access a large number of data in a short time thanks to the Internet, it is critical to analyze this data correctly. The need for text mining is increasing day by day by processing and analyzing the increasingly irregular text type data in the digital environment and classifying them in a meaningful way. In this study, news texts obtained from online German, Spanish, English and Turkish news sites were separated according to predetermined world, sports, economy and politics categories. The data set consisting of 4000 news texts was classified using 41 different machine learning algorithms in the Weka program. The highest successful classification was obtained with Naive Bayes Multinominal and Naive Bayes Multinominal Updateable algorithms, and 93.5% for German news texts, 93.3% for English news texts, 82.8% for Spanish news texts and 88.8% for Turkish news texts.

Project Number

Bulunmamaktadır.

References

  • Başkaya, F., & Aydın, İ. Haber Metinlerinin Farklı Metin Madenciliği Yöntemleriyle Sınıflandırılması, In 2017 International Artificial Intelligence and Data Processing Symposium (IDAP), 2017, pp. 1-5. IEEE.
  • Aydemir, E. , Işık, M. & Tuncer, T. Türkçe Haber Metinlerinin Çok Terimli Naive Bayes Algoritması Kullanılarak Sınıflandırılması, Fırat Üniversitesi Mühendislik Bilimleri Dergisi, 2021, 33(2), pp. 519-526. doi: 10.35234/fumbd.871986
  • Acı, Ç. & Çırak, A. Türkçe Haber Metinlerinin Konvolüsyonel Sinir Ağları ve Word2Vec Kullanılarak Sınıflandırılması, Bilişim Teknolojileri Dergisi, 2019, 12(3), pp. 219-228. doi: 10.17671/gazibtd.457917.
  • Uslu, O., & Akyol, S. Türkçe Haber Metinlerinin Makine Öğrenmesi Yöntemleri Kullanılarak Sınıflandırılması, ESTUDAM Bilişim Dergisi, 2019, 2(1), pp. 15-20.
  • Doğan, K., & Arslantekin, S. Büyük Veri: Önemi, Yapısı Ve Günümüzdeki Durum, Ankara Üniversitesi Dil ve Tarih-Coğrafya Fakültesi Dergisi, 2016, 56(1), pp.15-36.
  • Bach, M. P., Krstić, Ž., Seljan, S., & Turulja, L. Text mining for big data analysis in financial sector: A literature review, Sustainability, 2019, 11(5), pp. 1-27.
  • Tan, A. H. Text mining: The state of the art and the challenges, In Proceedings of the pakdd 1999 workshop on knowledge disocovery from advanced databases, 1999, pp. 65-70.
  • Coşkun, C., & Baykal, A. Veri Madenciliğinde Sınıflandırma Algoritmalarının Bir Örnek Üzerinde Karşılaştırılması. Akademik Bilişim, 2011, 11, pp. 51-58.
  • Dalal, M. K., & Zaveri, M. A. Automatic Text Classification: A Technical Review, International Journal of Computer Applications, 2011, 28(2), pp. 37-40.
  • Çelik, Ö., & Koç, B. C. TF-IDF, Word2vec ve Fasttext Vektör Model Yöntemleri ile Türkçe Haber Metinlerinin Sınıflandırılması, Dokuz Eylül Üniversitesi Mühendislik Fakültesi Fen ve Mühendislik Dergisi, 2021, 23(67), pp. 121-127.
  • Tantuğ, A. C. Metin Sınıflandırma, Türkiye Bilişim Vakfı Bilgisayar Bilimleri ve Mühendisliği Dergisi, 2016, 5(2).
  • Toraman, C., Can, F., & Koçberber, S. Developing A Text Categorization Template For Turkish News Portals, In 2011 International Symposium on Innovations in Intelligent Systems and Applications, 2011, pp. 379-383. IEEE.
  • Yıldırım, S., & Yıldız, T. Türkçe İçin Karşılaştırmalı Metin Sınıflandırma Analizi, Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi, 2018, 24(5), pp. 879-886.
  • Amasyalı, M. F., Diri, B., & Türkoğlu, F. Farklı Özellik Vektörleri İle Türkçe Dokümanların Yazarlarının Belirlenmesi, In The Fifteenth Turkish Symposium on Artificial Intelligence and Neural Networks (TAINN'2006), 2006, pp. 4.
  • Cusmuliuc, C. G., Coca, L. G. and Iftene, A. Identifying Fake News on Twitter using Naive Bayes, SVM and Random Forest Distributed Algorithms, In Proceedings of The 13th Edition of the International Conference on Linguistic Resources and Tools for Processing Romanian Language, 2018, pp.177-188.
  • Doğan, S., & Diri, B. Türkçe Dokümanlar için N-Gram Tabanlı Yeni Bir Sınıflandırma (Ng-İnd): Yazar, Tür ve Cinsiyet, Türkiye Bilişim Vakfı Bilgisayar Bilimleri ve Mühendisliği Dergisi, 2010, 3(1), pp. 11-19.
  • Aşlıyan, R., & Günel, K. Metin İçerikli Türkçe Dokümanların Sınıflandırılması, Akademik Bilişim Konferansı, 2010, pp. 659-665.
  • Soucy, P., & Mineau, G. W. A Simple KNN Algorithm For Text Categorization, In Proceedings 2001 IEEE international conference on data mining, 2001, pp. 647-648. IEEE.
  • Joachims, T. Text Categorization With Support Vector Machines: Learning With Many Relevant Features, In European conference on machine learning, 1998, pp. 137-142.
  • Ma, L., Shepherd, J., & Zhang, Y. Enhancing Text Classification Using Synopses Extraction, In Proceedings of the Fourth International Conference on Web Information Systems Engineering, 2003, pp. 115-124. IEEE.
  • Lam, S. L., & Lee, D. L. Feature Reduction For Neural Network Based Text Categorization, In Proceedings. 6th international conference on advanced systems for advanced applications, 1999, pp. 195-202. IEEE.
  • Ng, H. T., Goh, W. B., & Low, K. L. Feature Selection, Perceptron Learning, And A Usability Case Study For Text Categorization, In Proceedings Of The 20th Annual International ACM SIGIR Conference On Research And Development In Information Retrieval, 1997, pp. 67-73.
  • Nakayama, M., & Shimizu, Y. Subject Categorization for Web Educational Resources using MLP, In ESANN, 2003, pp. 9-14.
  • Srinivasan, P., & Ruiz, M. E. Automatic Text Categorization Using Neural Network, In Proceedings of the 8th ASIS SIG/CR Workshop on Classification Research, 1998, pp. 59-72.
  • Ma, S., & Ji, C. A Unified Approach on Fast Training of Feedforward and Recurrent Networks Using EM Algorithm, IEEE transactions on signal processing, 1998, 46(8), pp. 2270-2274. IEEE.
  • Şimşek, H. & Aydemir, E. Classification of Unwanted E-Mails (Spam) with Turkish Text by Different Algorithms in Weka Program, Journal of Soft Computing and Artificial Intelligence, 2022, 3(1) , pp. 1-10. doi: 10.55195/jscai.1104694
  • Dilrukshi, I., De Zoysa, K., & Caldera, A. Twitter News Classification Using SVM, In 2013 8th International Conference on Computer Science & Education, 2013, pp. 287-291. IEEE.
  • Deniz, E., Erbay, H., & Coşar, M. Classification Of Turkish E-Mails With Doc2Vec, In 2019 1st International Informatics and Software Engineering Conference (UBMYK), 2019, pp. 1-4. IEEE.
  • Sel, İ., Karci, A., & Hanbay, D. Feature Selection for Text Classification Using Mutual Information, In 2019 International Artificial Intelligence and Data Processing Symposium (IDAP), 2019, pp. 1-4. IEEE.
  • Jehad, R., & Yousif, S. A. Fake News Classification Using Random Forest and Decision Tree (J48), Al-Nahrain Journal of Science, 2020, 23(4), pp. 49-55.
  • Shahi, T. B., & Pant, A. K. Nepali News Classification Using Naïve Bayes, Support Vector Machines and Neural Networks, In 2018 International Conference on Communication Information and Computing Technology (ICCICT), 2018, pp. 1-5. IEEE.
  • Aydemir, E. Weka İle Yapay Zeka. Seçkin Yayınevi, 2018, Ankara.
  • Ağduk, S., Aydemir, E. & Polat, A. (2022). News Texts by Category in Different Languages [Data set]. Kaggle. https://doi.org/10.34740/KAGGLE/DSV/3572093
Year 2023, Volume: 4 Issue: 1, 29 - 37, 25.06.2023
https://doi.org/10.55195/jscai.1311380

Abstract

Supporting Institution

Bulunmamaktadır.

Project Number

Bulunmamaktadır.

References

  • Başkaya, F., & Aydın, İ. Haber Metinlerinin Farklı Metin Madenciliği Yöntemleriyle Sınıflandırılması, In 2017 International Artificial Intelligence and Data Processing Symposium (IDAP), 2017, pp. 1-5. IEEE.
  • Aydemir, E. , Işık, M. & Tuncer, T. Türkçe Haber Metinlerinin Çok Terimli Naive Bayes Algoritması Kullanılarak Sınıflandırılması, Fırat Üniversitesi Mühendislik Bilimleri Dergisi, 2021, 33(2), pp. 519-526. doi: 10.35234/fumbd.871986
  • Acı, Ç. & Çırak, A. Türkçe Haber Metinlerinin Konvolüsyonel Sinir Ağları ve Word2Vec Kullanılarak Sınıflandırılması, Bilişim Teknolojileri Dergisi, 2019, 12(3), pp. 219-228. doi: 10.17671/gazibtd.457917.
  • Uslu, O., & Akyol, S. Türkçe Haber Metinlerinin Makine Öğrenmesi Yöntemleri Kullanılarak Sınıflandırılması, ESTUDAM Bilişim Dergisi, 2019, 2(1), pp. 15-20.
  • Doğan, K., & Arslantekin, S. Büyük Veri: Önemi, Yapısı Ve Günümüzdeki Durum, Ankara Üniversitesi Dil ve Tarih-Coğrafya Fakültesi Dergisi, 2016, 56(1), pp.15-36.
  • Bach, M. P., Krstić, Ž., Seljan, S., & Turulja, L. Text mining for big data analysis in financial sector: A literature review, Sustainability, 2019, 11(5), pp. 1-27.
  • Tan, A. H. Text mining: The state of the art and the challenges, In Proceedings of the pakdd 1999 workshop on knowledge disocovery from advanced databases, 1999, pp. 65-70.
  • Coşkun, C., & Baykal, A. Veri Madenciliğinde Sınıflandırma Algoritmalarının Bir Örnek Üzerinde Karşılaştırılması. Akademik Bilişim, 2011, 11, pp. 51-58.
  • Dalal, M. K., & Zaveri, M. A. Automatic Text Classification: A Technical Review, International Journal of Computer Applications, 2011, 28(2), pp. 37-40.
  • Çelik, Ö., & Koç, B. C. TF-IDF, Word2vec ve Fasttext Vektör Model Yöntemleri ile Türkçe Haber Metinlerinin Sınıflandırılması, Dokuz Eylül Üniversitesi Mühendislik Fakültesi Fen ve Mühendislik Dergisi, 2021, 23(67), pp. 121-127.
  • Tantuğ, A. C. Metin Sınıflandırma, Türkiye Bilişim Vakfı Bilgisayar Bilimleri ve Mühendisliği Dergisi, 2016, 5(2).
  • Toraman, C., Can, F., & Koçberber, S. Developing A Text Categorization Template For Turkish News Portals, In 2011 International Symposium on Innovations in Intelligent Systems and Applications, 2011, pp. 379-383. IEEE.
  • Yıldırım, S., & Yıldız, T. Türkçe İçin Karşılaştırmalı Metin Sınıflandırma Analizi, Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi, 2018, 24(5), pp. 879-886.
  • Amasyalı, M. F., Diri, B., & Türkoğlu, F. Farklı Özellik Vektörleri İle Türkçe Dokümanların Yazarlarının Belirlenmesi, In The Fifteenth Turkish Symposium on Artificial Intelligence and Neural Networks (TAINN'2006), 2006, pp. 4.
  • Cusmuliuc, C. G., Coca, L. G. and Iftene, A. Identifying Fake News on Twitter using Naive Bayes, SVM and Random Forest Distributed Algorithms, In Proceedings of The 13th Edition of the International Conference on Linguistic Resources and Tools for Processing Romanian Language, 2018, pp.177-188.
  • Doğan, S., & Diri, B. Türkçe Dokümanlar için N-Gram Tabanlı Yeni Bir Sınıflandırma (Ng-İnd): Yazar, Tür ve Cinsiyet, Türkiye Bilişim Vakfı Bilgisayar Bilimleri ve Mühendisliği Dergisi, 2010, 3(1), pp. 11-19.
  • Aşlıyan, R., & Günel, K. Metin İçerikli Türkçe Dokümanların Sınıflandırılması, Akademik Bilişim Konferansı, 2010, pp. 659-665.
  • Soucy, P., & Mineau, G. W. A Simple KNN Algorithm For Text Categorization, In Proceedings 2001 IEEE international conference on data mining, 2001, pp. 647-648. IEEE.
  • Joachims, T. Text Categorization With Support Vector Machines: Learning With Many Relevant Features, In European conference on machine learning, 1998, pp. 137-142.
  • Ma, L., Shepherd, J., & Zhang, Y. Enhancing Text Classification Using Synopses Extraction, In Proceedings of the Fourth International Conference on Web Information Systems Engineering, 2003, pp. 115-124. IEEE.
  • Lam, S. L., & Lee, D. L. Feature Reduction For Neural Network Based Text Categorization, In Proceedings. 6th international conference on advanced systems for advanced applications, 1999, pp. 195-202. IEEE.
  • Ng, H. T., Goh, W. B., & Low, K. L. Feature Selection, Perceptron Learning, And A Usability Case Study For Text Categorization, In Proceedings Of The 20th Annual International ACM SIGIR Conference On Research And Development In Information Retrieval, 1997, pp. 67-73.
  • Nakayama, M., & Shimizu, Y. Subject Categorization for Web Educational Resources using MLP, In ESANN, 2003, pp. 9-14.
  • Srinivasan, P., & Ruiz, M. E. Automatic Text Categorization Using Neural Network, In Proceedings of the 8th ASIS SIG/CR Workshop on Classification Research, 1998, pp. 59-72.
  • Ma, S., & Ji, C. A Unified Approach on Fast Training of Feedforward and Recurrent Networks Using EM Algorithm, IEEE transactions on signal processing, 1998, 46(8), pp. 2270-2274. IEEE.
  • Şimşek, H. & Aydemir, E. Classification of Unwanted E-Mails (Spam) with Turkish Text by Different Algorithms in Weka Program, Journal of Soft Computing and Artificial Intelligence, 2022, 3(1) , pp. 1-10. doi: 10.55195/jscai.1104694
  • Dilrukshi, I., De Zoysa, K., & Caldera, A. Twitter News Classification Using SVM, In 2013 8th International Conference on Computer Science & Education, 2013, pp. 287-291. IEEE.
  • Deniz, E., Erbay, H., & Coşar, M. Classification Of Turkish E-Mails With Doc2Vec, In 2019 1st International Informatics and Software Engineering Conference (UBMYK), 2019, pp. 1-4. IEEE.
  • Sel, İ., Karci, A., & Hanbay, D. Feature Selection for Text Classification Using Mutual Information, In 2019 International Artificial Intelligence and Data Processing Symposium (IDAP), 2019, pp. 1-4. IEEE.
  • Jehad, R., & Yousif, S. A. Fake News Classification Using Random Forest and Decision Tree (J48), Al-Nahrain Journal of Science, 2020, 23(4), pp. 49-55.
  • Shahi, T. B., & Pant, A. K. Nepali News Classification Using Naïve Bayes, Support Vector Machines and Neural Networks, In 2018 International Conference on Communication Information and Computing Technology (ICCICT), 2018, pp. 1-5. IEEE.
  • Aydemir, E. Weka İle Yapay Zeka. Seçkin Yayınevi, 2018, Ankara.
  • Ağduk, S., Aydemir, E. & Polat, A. (2022). News Texts by Category in Different Languages [Data set]. Kaggle. https://doi.org/10.34740/KAGGLE/DSV/3572093
There are 33 citations in total.

Details

Primary Language English
Subjects Computer Software
Journal Section Research Articles
Authors

Sidar Ağduk 0000-0002-2927-0077

Emrah Aydemir 0000-0002-8380-7891

Ayfer Polat 0000-0002-3838-173X

Project Number Bulunmamaktadır.
Early Pub Date June 30, 2023
Publication Date June 25, 2023
Submission Date June 8, 2023
Published in Issue Year 2023 Volume: 4 Issue: 1

Cite

APA Ağduk, S., Aydemir, E., & Polat, A. (2023). Classification of News Texts from Different Languages with Machine Learning Algorithms. Journal of Soft Computing and Artificial Intelligence, 4(1), 29-37. https://doi.org/10.55195/jscai.1311380
AMA Ağduk S, Aydemir E, Polat A. Classification of News Texts from Different Languages with Machine Learning Algorithms. JSCAI. June 2023;4(1):29-37. doi:10.55195/jscai.1311380
Chicago Ağduk, Sidar, Emrah Aydemir, and Ayfer Polat. “Classification of News Texts from Different Languages With Machine Learning Algorithms”. Journal of Soft Computing and Artificial Intelligence 4, no. 1 (June 2023): 29-37. https://doi.org/10.55195/jscai.1311380.
EndNote Ağduk S, Aydemir E, Polat A (June 1, 2023) Classification of News Texts from Different Languages with Machine Learning Algorithms. Journal of Soft Computing and Artificial Intelligence 4 1 29–37.
IEEE S. Ağduk, E. Aydemir, and A. Polat, “Classification of News Texts from Different Languages with Machine Learning Algorithms”, JSCAI, vol. 4, no. 1, pp. 29–37, 2023, doi: 10.55195/jscai.1311380.
ISNAD Ağduk, Sidar et al. “Classification of News Texts from Different Languages With Machine Learning Algorithms”. Journal of Soft Computing and Artificial Intelligence 4/1 (June 2023), 29-37. https://doi.org/10.55195/jscai.1311380.
JAMA Ağduk S, Aydemir E, Polat A. Classification of News Texts from Different Languages with Machine Learning Algorithms. JSCAI. 2023;4:29–37.
MLA Ağduk, Sidar et al. “Classification of News Texts from Different Languages With Machine Learning Algorithms”. Journal of Soft Computing and Artificial Intelligence, vol. 4, no. 1, 2023, pp. 29-37, doi:10.55195/jscai.1311380.
Vancouver Ağduk S, Aydemir E, Polat A. Classification of News Texts from Different Languages with Machine Learning Algorithms. JSCAI. 2023;4(1):29-37.