A study on effective feature extraction and genetic algorithm based feature selection method in fake news detection classification using machine learning approaches

Ramazan İncir; Mete Yağanoğlu; Ferhat Bozkurt

doi:10.17714/gumusfenbil.1396652

Araştırma Makalesi

Makine öğrenimi yaklaşımları kullanılarak sahte haber tespiti sınıflandırmasında etkili özellik çıkarma ve genetik algoritma tabanlı özellik seçimi yöntemi üzerine bir çalışma

Yıl 2024, Cilt: 14 Sayı: 3, 764 - 776, 15.09.2024

Ramazan İncir , Mete Yağanoğlu , Ferhat Bozkurt

https://doi.org/10.17714/gumusfenbil.1396652

Öz

Günümüz teknolojisinde bilgi çevrimiçi sosyal ağlar aracılığıyla hızla yayılarak hayatımızı kolaylaştırmaktadır. Ancak sahte haberler eleştirel bir değerlendirme yapılmadan paylaşıldığında geniş kitlelere kolaylıkla ulaştığı için topluma zarar verebilmekte ve sosyal, politik ve ekonomik yönleri etkileyebilmektedir. Bu noktada içerik doğrulama ve teyit sistemlerinin geliştirilmesi önem arz etmektedir. Bu çalışmada İngilizce ve Almanca haber içeriklerinin yer aldığı çok sınıflı bir veri seti üzerinde tek dilli ve diller arası bir sınıflandırma yapılması amaçlanmıştır. Sınıflandırmadan önce CountVectorizer ve stilometrik özellik çıkarımı da dâhil olmak üzere veri ön işleme uygulanmıştır. Özellik seçimi, doğadaki evrim fikrine dayanan bir algoritma olan genetik algoritma kullanılarak yapılmıştır. Seçilen özellikler Rastgele Orman, Lojistik Regresyon, Multinomial Naive Bayes, Karar Ağacı ve K-En Yakın komşu makine öğrenme algoritmaları ile sınıflandırılmıştır. Sınıflandırma sonucunda tek dilli İngilizce haber metinleri için Multinomial Naive Bayes algoritması ile %58.49 Doğruluk ve %42.97 makro-F1 elde edilirken, İngilizce ve Almanca haber metinleri kullanılarak diller arası sınıflandırmada Lojistik Regresyon algoritması ile %45.39 Doğruluk ve %37.70 makro-F1 elde edilmiştir. Aynı veri seti ile yapılan çalışmalara göre oldukça başarılı sonuçlar elde edildiği gözlemlenmiştir. Ayrıca ISOT veri setine de aynı metodoloji uygulanmıştır. Lojistik Regresyon ve Karar Ağacı algoritmaları ile sırasıyla %99.48 ve %99.62 makro-F1 elde edilmiştir.

Anahtar Kelimeler

Diller arası sınıflandırma, Sahte haber tespiti, Genetik algoritma, Makine öğrenimi, Tek dilli sınıflandırma

Kaynakça

Aborisade, O., & Anwar, M. (2018). Classification for authorship of tweets by comparing logistic regression and naive bayes classifiers. 2018 IEEE International Conference on Information Reuse and Integration (IRI), 269–276. https://doi.org/10.1109/IRI.2018.00049.
Ahmad, I., Yousaf, M., Yousaf, S., & Ahmad, M. O. (2020). Fake news detection using machine learning ensemble methods. Complexity, 2020, 1–11. https://doi.org/10.1155/2020/8885861.
Ahmed, H., Traore, I., & Saad, S. (2017). Detection of online fake news using n-gram analysis and machine learning techniques. Intelligent, Secure, and Dependable Systems in Distributed and Cloud Environments: First International Conference, ISDDC 2017, Vancouver, BC, Canada, October 26-28, 2017, Proceedings 1, 127–138. https://doi.org/10.1007/978-3-319-69155-8_9.
Althabiti, S., Alsalka, M. A., & Atwell, E. (2022). SCUoL at CheckThat! 2022: fake news detection using transformer-based models. CEUR Workshop Proceedings, 3180, 428–433.
Arif, M., Tonja, A. L., Ameer, I., Kolesnikova, O., Gelbukh, A., Sidorov, G., & Meque, A. G. M. (2022). CIC at CheckThat! 2022: multi-class and cross-lingual fake news detection. Working Notes of CLEF.
Blanc, O., Pritzkau, A., Schade, U., & Geierhos, M. (2022). CODE at CheckThat! 2022: multi-class fake news detection of news articles with BERT. Working Notes of CLEF.
Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32. https://doi.org/10.1023/a:1010933404324.
Choudhary, N., & Jain, A. K. (2017). Towards filtering of SMS spam messages using machine learning based technique. Advanced Informatics for Computing Research: First International Conference, ICAICR 2017, Jalandhar, India, March 17--18, 2017, Revised Selected Papers, 18–30. https://doi.org/10.1007/978-981-10-5780-9_2.
DiFranzo, D., & Gloria-Garcia, K. (2017). Filter bubbles and fake news. XRDS: Crossroads, The ACM Magazine for Students, 23(3), 32–35. https://doi.org/10.1145/3055153.
Fayaz, M., Khan, A., Bilal, M., & Khan, S. U. (2022). Machine learning for fake news classification with optimal feature selection. Soft Computing, 26(16), 7763–7771. https://doi.org/10.1007/s00500-022-06773-x.
Ghayoomi, M., & Mousavian, M. (2022). Deep transfer learning for COVID-19 fake news detection in Persian. Expert Systems, 39(8), e13008. https://doi.org/10.1111/exsy.13008.
Hayes-Roth, F. (1975). Review of" Adaptation in Natural and Artificial Systems by John H. Holland", The U. of Michigan Press, 1975. ACM SIGART Bulletin, 53, 15–15.
ISOT Fake News Dataset. (2023, February 10). https://onlineacademiccommunity.uvic.ca/isot/2022/11/27/fake-news-detection-datasets/
Jain, A., & Kasbe, A. (2018). Fake News Detection. 2018 IEEE International Students’ Conference on Electrical, Electronics and Computer Science (SCEECS), 1–5. https://doi.org/10.1109/SCEECS.2018.8546944.
Jose, X., Kumar, S. D. M., & Chandran, P. (2021). Characterization, Classification and Detection of Fake News in Online Social Media Networks. 2021 IEEE Mysore Sub Section International Conference (MysuruCon), 759–765. https://doi.org/10.1109/MysuruCon52639.2021.9641517.
Kumar, S., Kumar, G., & Singh, S. R. (2022). TextMinor at CheckThat! 2022: fake news article detection using RoBERT. Working Notes of CLEF.
La Barbera, D., Roitero, K., Mackenzie, J., Damiano, S., Demartini, G., & Mizzaro, S. (2022). BUM at CheckThat! 2022: a composite deep learning approach to fake news detection using evidence retrieval. Working Notes of CLEF.
Lagutina, K., Lagutina, N., Boychuk, E., Vorontsova, I., Shliakhtina, E., Belyaeva, O., Paramonov, I., & Demidov, P. G. (2019). A survey on stylometric text features. 2019 25th Conference of Open Innovations Association (FRUCT), 184–195. https://doi.org/10.23919/FRUCT48121.2019.8981504.
LekshmiAmmal, H. R., & Madasamy, A. K. (2022). NITK-IT NLP at CheckThat! 2022: Window based approach for Fake News Detection using transformers.
Lima, G. B., Chaves, T. de M., Freitas, W. W. L., & de Souza, R. M. (2022). Statistical learning from Brazilian fake news. Expert Systems, e13171. https://doi.org/10.1111/exsy.13171.
Ludwig, A., Felser, J., Xi, J., Labudde, D., & Spranger, M. (2022). FoSIL at CheckThat! 2022: using human behaviour-based optimization for text classification. Working Notes of CLEF.
Martinez-Rico, J. R., Martinez-Romo, J., & Araujo, L. (2022). NLP &IRUNED at CheckThat! 2022: ensemble of classifiers for fake news detection. Working Notes of CLEF.
McCallum, A., Nigam, K., & Others. (1998). A comparison of event models for naive bayes text classification. AAAI-98 Workshop on Learning for Text Categorization, 752, 41–48.
Nasir, J. A., Khan, O. S., & Varlamis, I. (2021). Fake news detection: A hybrid CNN-RNN based deep learning approach. International Journal of Information Management Data Insights, 1(1), 100007. https://doi.org/10.1016/j.jjimei.2020.100007.
Porto-Capetillo, C., Lecuona-Gómez, D., Gómez-Adorno, H., Arroyo-Fernández, I., & Neri-Chávez, J. (2022). HBDCI at CheckThat! 2022: Fake News Detection Using a Combination of stylometric Features and Deep Learning.
Pritzkau, A., Blanc, O., Geierhos, M., & Schade, U. (2022). NLytics at CheckThat! 2022: hierarchical multi-class fake news detection of news articles exploiting the topic structure. Working Notes of CLEF.
Schütz, M., Böck, J., Andresel, M., Kirchknopf, A., Liakhovets, D., Slijepčević, D., & Schindler, A. (2022). AIT FHSTP at CheckThat! 2022: cross-lingual fake news detection with a large pre-trained transformer. Working Notes of CLEF.
Shahi, G. K., Struß, J. M., Mandl, T., Köhler, J., Wiegand, M., & Siegel, M. (2022, May 16). CT-FAN: A Multilingual dataset for Fake News Detection. Zenodo. https://zenodo.org/records/6555293
Taboubi, B., Nessir, M. A. B., & Haddad, H. (2022). iCompass at CheckThat! 2022: combining deep language models for fake news detection. Working Notes of CLEF.
Tran, H. N., & Kruschwitz, U. (2022). ur-iw-hnt at CheckThat! 2022: cross-lingual text summarization for fake news detection. Working Notes of CLEF.
Truică, C.-O., Apostol, E.-S., & Paschke, A. (2022). Awakened at CheckThat! 2022: Fake news detection using BiLSTM and sentence transformer. Working Notes of CLEF.
Vogel, I., & Jiang, P. (2019). Fake news detection with the new German dataset “GermanFakeNC.” Digital Libraries for Open Knowledge: 23rd International Conference on Theory and Practice of Digital Libraries, TPDL 2019, Oslo, Norway, September 9-12, 2019, Proceedings 23, 288–295. https://doi.org/10.1007/978-3-030-30760-8_25.
Zhang, D., Xu, J., Zadorozhny, V., & Grant, J. (2022). Fake news detection based on statement conflict. Journal of Intelligent Information Systems, 59(1), 173–192. https://doi.org/10.1007/s10844-021-00678-1.
Zheng, R., Li, J., Chen, H., & Huang, Z. (2006). A framework for authorship identification of online messages: Writing-style features and classification techniques. Journal of the American Society for Information Science and Technology, 57(3), 378–393. https://doi.org/10.1002/asi.20316.

A study on effective feature extraction and genetic algorithm based feature selection method in fake news detection classification using machine learning approaches

Yıl 2024, Cilt: 14 Sayı: 3, 764 - 776, 15.09.2024

Ramazan İncir , Mete Yağanoğlu , Ferhat Bozkurt

https://doi.org/10.17714/gumusfenbil.1396652

Öz

In today's technology, information spreads quickly through online social networks, making our lives easier. However, when false news is shared without critical evaluation, it can harm society and affect social, political and economic aspects as it reaches a wide audience. At this point, it is important to develop content verification and confirmation systems. In this study, the aim is to conduct monolingual and cross-lingual classification on a multi-class dataset containing English and German news content. We applied data preprocessing, including CountVectorizer and stylometric feature extraction, before classification. Feature selection was made using the genetic algorithm, which is an algorithm based on the idea of evolution in nature. Selected features were classified by Random Forest, Logistic Regression, Multinomial Naive Bayes, Decision Tree and KNearest Neighbors machine learning algorithms. In the classification process, Multinomial Naive Bayes achieved 58.49% Accuracy and 42.97% macro-F1 for monolingual English news texts, while Logistic Regression achieved 45.39% Accuracy and 37.70% macro-F1 in Cross-lingual classification using English and German news texts. Significantly successful results were obtained compared to studies conducted with the same dataset. In addition, the same methodology was applied to the ISOT dataset. 99.48% and 99.62% macro-F1 were obtained by Logistic Regression and Decision Tree algorithms, respectively.

Anahtar Kelimeler

Cross-lingual classification, Fake news detection, Genetic algorithm, Machine learning, Monolingual classification

Kaynakça

Aborisade, O., & Anwar, M. (2018). Classification for authorship of tweets by comparing logistic regression and naive bayes classifiers. 2018 IEEE International Conference on Information Reuse and Integration (IRI), 269–276. https://doi.org/10.1109/IRI.2018.00049.
Ahmad, I., Yousaf, M., Yousaf, S., & Ahmad, M. O. (2020). Fake news detection using machine learning ensemble methods. Complexity, 2020, 1–11. https://doi.org/10.1155/2020/8885861.
Ahmed, H., Traore, I., & Saad, S. (2017). Detection of online fake news using n-gram analysis and machine learning techniques. Intelligent, Secure, and Dependable Systems in Distributed and Cloud Environments: First International Conference, ISDDC 2017, Vancouver, BC, Canada, October 26-28, 2017, Proceedings 1, 127–138. https://doi.org/10.1007/978-3-319-69155-8_9.
Althabiti, S., Alsalka, M. A., & Atwell, E. (2022). SCUoL at CheckThat! 2022: fake news detection using transformer-based models. CEUR Workshop Proceedings, 3180, 428–433.
Arif, M., Tonja, A. L., Ameer, I., Kolesnikova, O., Gelbukh, A., Sidorov, G., & Meque, A. G. M. (2022). CIC at CheckThat! 2022: multi-class and cross-lingual fake news detection. Working Notes of CLEF.
Blanc, O., Pritzkau, A., Schade, U., & Geierhos, M. (2022). CODE at CheckThat! 2022: multi-class fake news detection of news articles with BERT. Working Notes of CLEF.
Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32. https://doi.org/10.1023/a:1010933404324.
Choudhary, N., & Jain, A. K. (2017). Towards filtering of SMS spam messages using machine learning based technique. Advanced Informatics for Computing Research: First International Conference, ICAICR 2017, Jalandhar, India, March 17--18, 2017, Revised Selected Papers, 18–30. https://doi.org/10.1007/978-981-10-5780-9_2.
DiFranzo, D., & Gloria-Garcia, K. (2017). Filter bubbles and fake news. XRDS: Crossroads, The ACM Magazine for Students, 23(3), 32–35. https://doi.org/10.1145/3055153.
Fayaz, M., Khan, A., Bilal, M., & Khan, S. U. (2022). Machine learning for fake news classification with optimal feature selection. Soft Computing, 26(16), 7763–7771. https://doi.org/10.1007/s00500-022-06773-x.
Ghayoomi, M., & Mousavian, M. (2022). Deep transfer learning for COVID-19 fake news detection in Persian. Expert Systems, 39(8), e13008. https://doi.org/10.1111/exsy.13008.
Hayes-Roth, F. (1975). Review of" Adaptation in Natural and Artificial Systems by John H. Holland", The U. of Michigan Press, 1975. ACM SIGART Bulletin, 53, 15–15.
ISOT Fake News Dataset. (2023, February 10). https://onlineacademiccommunity.uvic.ca/isot/2022/11/27/fake-news-detection-datasets/
Jain, A., & Kasbe, A. (2018). Fake News Detection. 2018 IEEE International Students’ Conference on Electrical, Electronics and Computer Science (SCEECS), 1–5. https://doi.org/10.1109/SCEECS.2018.8546944.
Jose, X., Kumar, S. D. M., & Chandran, P. (2021). Characterization, Classification and Detection of Fake News in Online Social Media Networks. 2021 IEEE Mysore Sub Section International Conference (MysuruCon), 759–765. https://doi.org/10.1109/MysuruCon52639.2021.9641517.
Kumar, S., Kumar, G., & Singh, S. R. (2022). TextMinor at CheckThat! 2022: fake news article detection using RoBERT. Working Notes of CLEF.
La Barbera, D., Roitero, K., Mackenzie, J., Damiano, S., Demartini, G., & Mizzaro, S. (2022). BUM at CheckThat! 2022: a composite deep learning approach to fake news detection using evidence retrieval. Working Notes of CLEF.
Lagutina, K., Lagutina, N., Boychuk, E., Vorontsova, I., Shliakhtina, E., Belyaeva, O., Paramonov, I., & Demidov, P. G. (2019). A survey on stylometric text features. 2019 25th Conference of Open Innovations Association (FRUCT), 184–195. https://doi.org/10.23919/FRUCT48121.2019.8981504.
LekshmiAmmal, H. R., & Madasamy, A. K. (2022). NITK-IT NLP at CheckThat! 2022: Window based approach for Fake News Detection using transformers.
Lima, G. B., Chaves, T. de M., Freitas, W. W. L., & de Souza, R. M. (2022). Statistical learning from Brazilian fake news. Expert Systems, e13171. https://doi.org/10.1111/exsy.13171.
Ludwig, A., Felser, J., Xi, J., Labudde, D., & Spranger, M. (2022). FoSIL at CheckThat! 2022: using human behaviour-based optimization for text classification. Working Notes of CLEF.
Martinez-Rico, J. R., Martinez-Romo, J., & Araujo, L. (2022). NLP &IRUNED at CheckThat! 2022: ensemble of classifiers for fake news detection. Working Notes of CLEF.
McCallum, A., Nigam, K., & Others. (1998). A comparison of event models for naive bayes text classification. AAAI-98 Workshop on Learning for Text Categorization, 752, 41–48.
Nasir, J. A., Khan, O. S., & Varlamis, I. (2021). Fake news detection: A hybrid CNN-RNN based deep learning approach. International Journal of Information Management Data Insights, 1(1), 100007. https://doi.org/10.1016/j.jjimei.2020.100007.
Porto-Capetillo, C., Lecuona-Gómez, D., Gómez-Adorno, H., Arroyo-Fernández, I., & Neri-Chávez, J. (2022). HBDCI at CheckThat! 2022: Fake News Detection Using a Combination of stylometric Features and Deep Learning.
Pritzkau, A., Blanc, O., Geierhos, M., & Schade, U. (2022). NLytics at CheckThat! 2022: hierarchical multi-class fake news detection of news articles exploiting the topic structure. Working Notes of CLEF.
Schütz, M., Böck, J., Andresel, M., Kirchknopf, A., Liakhovets, D., Slijepčević, D., & Schindler, A. (2022). AIT FHSTP at CheckThat! 2022: cross-lingual fake news detection with a large pre-trained transformer. Working Notes of CLEF.
Shahi, G. K., Struß, J. M., Mandl, T., Köhler, J., Wiegand, M., & Siegel, M. (2022, May 16). CT-FAN: A Multilingual dataset for Fake News Detection. Zenodo. https://zenodo.org/records/6555293
Taboubi, B., Nessir, M. A. B., & Haddad, H. (2022). iCompass at CheckThat! 2022: combining deep language models for fake news detection. Working Notes of CLEF.
Tran, H. N., & Kruschwitz, U. (2022). ur-iw-hnt at CheckThat! 2022: cross-lingual text summarization for fake news detection. Working Notes of CLEF.
Truică, C.-O., Apostol, E.-S., & Paschke, A. (2022). Awakened at CheckThat! 2022: Fake news detection using BiLSTM and sentence transformer. Working Notes of CLEF.
Vogel, I., & Jiang, P. (2019). Fake news detection with the new German dataset “GermanFakeNC.” Digital Libraries for Open Knowledge: 23rd International Conference on Theory and Practice of Digital Libraries, TPDL 2019, Oslo, Norway, September 9-12, 2019, Proceedings 23, 288–295. https://doi.org/10.1007/978-3-030-30760-8_25.
Zhang, D., Xu, J., Zadorozhny, V., & Grant, J. (2022). Fake news detection based on statement conflict. Journal of Intelligent Information Systems, 59(1), 173–192. https://doi.org/10.1007/s10844-021-00678-1.
Zheng, R., Li, J., Chen, H., & Huang, Z. (2006). A framework for authorship identification of online messages: Writing-style features and classification techniques. Journal of the American Society for Information Science and Technology, 57(3), 378–393. https://doi.org/10.1002/asi.20316.

Toplam 34 adet kaynakça vardır.

Ayrıntılar

Birincil Dil	İngilizce
Konular	Doğal Dil İşleme
Bölüm	Makaleler
Yazarlar	Ramazan İncir 0000-0002-7869-9945 Mete Yağanoğlu 0000-0003-3045-169X Ferhat Bozkurt 0000-0003-0088-5825
Yayımlanma Tarihi	15 Eylül 2024
Gönderilme Tarihi	27 Kasım 2023
Kabul Tarihi	6 Mayıs 2024
Yayımlandığı Sayı	Yıl 2024 Cilt: 14 Sayı: 3

Kaynak Göster

APA	İncir, R., Yağanoğlu, M., & Bozkurt, F. (2024). A study on effective feature extraction and genetic algorithm based feature selection method in fake news detection classification using machine learning approaches. Gümüşhane Üniversitesi Fen Bilimleri Dergisi, 14(3), 764-776. https://doi.org/10.17714/gumusfenbil.1396652

Kapak Resmi İndir

Makale Dosyaları

Tam Metin