TR
EN
Performance comparison of data balancing techniques on hate speech detection in Turkish
Öz
Increasing hate speech on social media platforms causes psychological disorders and deep and negative effects. Automatic language classification models are needed to detect hate speech. When testing language models for hate speech, imbalanced datasets where one data class is represented much more frequently than the other can be a problem in language datasets. When the dataset is imbalanced, the classifier may be biased towards the majority class and may not perform
well in the minority class. This can lead to incorrect or unreliable classification results. To solve this problem, data level balancing methods such as oversampling or undersampling are used to balance the class distribution before classifying the dataset. This study, it is aimed to achieve a successful classification model combination that detects hate speech by using data-level balancing methods. For this, a comprehensive study was carried out by applying the balancing method at eight data levels (random oversampling, Synthetic Minority Oversampling Technique (SMOTE), K-means SMOTE, Localized Random Affine Shadow Sample (LoRAS), Text-based Generative AdversarialNetwork (TextGAN), Nearmiss, Tomek Links ve Clustering-based) to the Abusive Turkish Comments (ATC) dataset, which has an imbalanced
distribution of labels, obtained from Instagram. Classification performances of data level balancing methods were evaluated with Basic Machine Learning (BML) and Convolutional Neural Network (CNN) methods. It has been observed that the CBoW+CNN model based on the TextGAN data-level balancing method, as well as the Skip-gram CNN model, exhibited the best classification performance with a MacroAveraged F1 score of 0.972.
Anahtar Kelimeler
Kaynakça
- [1] Hudson DLJ. “Is Cyberbullying free speech”. American Bar Association Journal, 102, 1-8, 2016.
- [2] Kottasová, I. “Europe says Twitter is failing to remove hate speech”. https://money.cnn.com/2017/06/01/technology/twitte r-facebook-hate-speech-europe/index.html (19.12.2019).
- [3] Park JH, Fung P. “One-step and Two-step Classification for Abusive Language Detection on Twitter”. arXiv, 2017. https://arxiv.org/pdf/1706.01206.pdf.
- [4] Chen H, McKeever S, Delany S J. “Harnessing the power of text mining for the detection of abusive content in social media”. Advances in Computational Intelligence Systems: Workshop on Computational Intelligence, Lancaster, UK, 7-9 September 2017.
- [5] Wiegand M, Siegel M, Ruppenhofer J. “Overview of the GermEval 2018 Shared Task on the Identification of Offensive Language”. 14. Conference on Natural Language Processing, Vienna, Austria, 21 September 2018.
- [6] Davidson T, Warmsley D, Macy M, Weber I. “Automated Hate Speech Detection and the Problem of Offensive Language”. 11. Conference on Web and Social Media, Montreal, Canada, 15-18 May 2017.
- [7] Karayigit H, Acı Cİ, Akdaglı A. “Detecting abusive Instagram comments in Turkish using convolutional Neural network and machine learning methods”. Expert System with Applications, 174, 1-14, 2021.
- [8] Ozel SA, Akdemir S, Sarac E, Aksu H. “Detection of cyberbullying on social media messages in Turkish”. 2017 International Conference on Computer Science and Engineering (UBMK), Antalya, Turkey, 05-08 October 2017.
Ayrıntılar
Birincil Dil
İngilizce
Konular
Algoritmalar ve Hesaplama Kuramı, Elektrik Mühendisliği (Diğer)
Bölüm
Araştırma Makalesi
Yayımlanma Tarihi
30 Ekim 2024
Gönderilme Tarihi
13 Mart 2023
Kabul Tarihi
10 Ekim 2023
Yayımlandığı Sayı
Yıl 2024 Cilt: 30 Sayı: 5
APA
Karayiğit, H., Akdağli, A., & Acı, Ç. İ. (2024). Performance comparison of data balancing techniques on hate speech detection in Turkish. Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi, 30(5), 610-621. https://izlik.org/JA87YW64ZT
AMA
1.Karayiğit H, Akdağli A, Acı Çİ. Performance comparison of data balancing techniques on hate speech detection in Turkish. Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi. 2024;30(5):610-621. https://izlik.org/JA87YW64ZT
Chicago
Karayiğit, Habibe, Ali Akdağli, ve Çiğdem İnan Acı. 2024. “Performance comparison of data balancing techniques on hate speech detection in Turkish”. Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi 30 (5): 610-21. https://izlik.org/JA87YW64ZT.
EndNote
Karayiğit H, Akdağli A, Acı Çİ (01 Ekim 2024) Performance comparison of data balancing techniques on hate speech detection in Turkish. Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi 30 5 610–621.
IEEE
[1]H. Karayiğit, A. Akdağli, ve Ç. İ. Acı, “Performance comparison of data balancing techniques on hate speech detection in Turkish”, Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi, c. 30, sy 5, ss. 610–621, Eki. 2024, [çevrimiçi]. Erişim adresi: https://izlik.org/JA87YW64ZT
ISNAD
Karayiğit, Habibe - Akdağli, Ali - Acı, Çiğdem İnan. “Performance comparison of data balancing techniques on hate speech detection in Turkish”. Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi 30/5 (01 Ekim 2024): 610-621. https://izlik.org/JA87YW64ZT.
JAMA
1.Karayiğit H, Akdağli A, Acı Çİ. Performance comparison of data balancing techniques on hate speech detection in Turkish. Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi. 2024;30:610–621.
MLA
Karayiğit, Habibe, vd. “Performance comparison of data balancing techniques on hate speech detection in Turkish”. Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi, c. 30, sy 5, Ekim 2024, ss. 610-21, https://izlik.org/JA87YW64ZT.
Vancouver
1.Habibe Karayiğit, Ali Akdağli, Çiğdem İnan Acı. Performance comparison of data balancing techniques on hate speech detection in Turkish. Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi [Internet]. 01 Ekim 2024;30(5):610-21. Erişim adresi: https://izlik.org/JA87YW64ZT