Araştırma Makalesi

Performance comparison of data balancing techniques on hate speech detection in Turkish

Cilt: 30 Sayı: 5 30 Ekim 2024
PDF İndir
TR EN

Performance comparison of data balancing techniques on hate speech detection in Turkish

Abstract

Increasing hate speech on social media platforms causes psychological disorders and deep and negative effects. Automatic language classification models are needed to detect hate speech. When testing language models for hate speech, imbalanced datasets where one data class is represented much more frequently than the other can be a problem in language datasets. When the dataset is imbalanced, the classifier may be biased towards the majority class and may not perform well in the minority class. This can lead to incorrect or unreliable classification results. To solve this problem, data level balancing methods such as oversampling or undersampling are used to balance the class distribution before classifying the dataset. This study, it is aimed to achieve a successful classification model combination that detects hate speech by using data-level balancing methods. For this, a comprehensive study was carried out by applying the balancing method at eight data levels (random oversampling, Synthetic Minority Oversampling Technique (SMOTE), K-means SMOTE, Localized Random Affine Shadow Sample (LoRAS), Text-based Generative AdversarialNetwork (TextGAN), Nearmiss, Tomek Links ve Clustering-based) to the Abusive Turkish Comments (ATC) dataset, which has an imbalanced distribution of labels, obtained from Instagram. Classification performances of data level balancing methods were evaluated with Basic Machine Learning (BML) and Convolutional Neural Network (CNN) methods. It has been observed that the CBoW+CNN model based on the TextGAN data-level balancing method, as well as the Skip-gram CNN model, exhibited the best classification performance with a MacroAveraged F1 score of 0.972.

Keywords

Kaynakça

  1. [1] Hudson DLJ. “Is Cyberbullying free speech”. American Bar Association Journal, 102, 1-8, 2016.
  2. [2] Kottasová, I. “Europe says Twitter is failing to remove hate speech”. https://money.cnn.com/2017/06/01/technology/twitte r-facebook-hate-speech-europe/index.html (19.12.2019).
  3. [3] Park JH, Fung P. “One-step and Two-step Classification for Abusive Language Detection on Twitter”. arXiv, 2017. https://arxiv.org/pdf/1706.01206.pdf.
  4. [4] Chen H, McKeever S, Delany S J. “Harnessing the power of text mining for the detection of abusive content in social media”. Advances in Computational Intelligence Systems: Workshop on Computational Intelligence, Lancaster, UK, 7-9 September 2017.
  5. [5] Wiegand M, Siegel M, Ruppenhofer J. “Overview of the GermEval 2018 Shared Task on the Identification of Offensive Language”. 14. Conference on Natural Language Processing, Vienna, Austria, 21 September 2018.
  6. [6] Davidson T, Warmsley D, Macy M, Weber I. “Automated Hate Speech Detection and the Problem of Offensive Language”. 11. Conference on Web and Social Media, Montreal, Canada, 15-18 May 2017.
  7. [7] Karayigit H, Acı Cİ, Akdaglı A. “Detecting abusive Instagram comments in Turkish using convolutional Neural network and machine learning methods”. Expert System with Applications, 174, 1-14, 2021.
  8. [8] Ozel SA, Akdemir S, Sarac E, Aksu H. “Detection of cyberbullying on social media messages in Turkish”. 2017 International Conference on Computer Science and Engineering (UBMK), Antalya, Turkey, 05-08 October 2017.

Ayrıntılar

Birincil Dil

İngilizce

Konular

Algoritmalar ve Hesaplama Kuramı , Elektrik Mühendisliği (Diğer)

Bölüm

Araştırma Makalesi

Yayımlanma Tarihi

30 Ekim 2024

Gönderilme Tarihi

13 Mart 2023

Kabul Tarihi

10 Ekim 2023

Yayımlandığı Sayı

Yıl 2024 Cilt: 30 Sayı: 5

Kaynak Göster

APA
Karayiğit, H., Akdağli, A., & Acı, Ç. İ. (2024). Performance comparison of data balancing techniques on hate speech detection in Turkish. Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi, 30(5), 610-621. https://izlik.org/JA87YW64ZT
AMA
1.Karayiğit H, Akdağli A, Acı Çİ. Performance comparison of data balancing techniques on hate speech detection in Turkish. Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi. 2024;30(5):610-621. https://izlik.org/JA87YW64ZT
Chicago
Karayiğit, Habibe, Ali Akdağli, ve Çiğdem İnan Acı. 2024. “Performance comparison of data balancing techniques on hate speech detection in Turkish”. Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi 30 (5): 610-21. https://izlik.org/JA87YW64ZT.
EndNote
Karayiğit H, Akdağli A, Acı Çİ (01 Ekim 2024) Performance comparison of data balancing techniques on hate speech detection in Turkish. Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi 30 5 610–621.
IEEE
[1]H. Karayiğit, A. Akdağli, ve Ç. İ. Acı, “Performance comparison of data balancing techniques on hate speech detection in Turkish”, Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi, c. 30, sy 5, ss. 610–621, Eki. 2024, [çevrimiçi]. Erişim adresi: https://izlik.org/JA87YW64ZT
ISNAD
Karayiğit, Habibe - Akdağli, Ali - Acı, Çiğdem İnan. “Performance comparison of data balancing techniques on hate speech detection in Turkish”. Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi 30/5 (01 Ekim 2024): 610-621. https://izlik.org/JA87YW64ZT.
JAMA
1.Karayiğit H, Akdağli A, Acı Çİ. Performance comparison of data balancing techniques on hate speech detection in Turkish. Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi. 2024;30:610–621.
MLA
Karayiğit, Habibe, vd. “Performance comparison of data balancing techniques on hate speech detection in Turkish”. Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi, c. 30, sy 5, Ekim 2024, ss. 610-21, https://izlik.org/JA87YW64ZT.
Vancouver
1.Habibe Karayiğit, Ali Akdağli, Çiğdem İnan Acı. Performance comparison of data balancing techniques on hate speech detection in Turkish. Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi [Internet]. 01 Ekim 2024;30(5):610-21. Erişim adresi: https://izlik.org/JA87YW64ZT