Performance comparison of data balancing techniques on hate speech detection in Turkish

Habibe Karayiğit; Ali Akdağli; Çiğdem İnan Acı

TR EN

Türkçe nefret söylemi tespitinde veri dengeleme tekniklerinin performans karşılaştırması

Öz

Sosyal medya platformlarında artan nefret söylemleri, psikolojik rahatsızlıklara, derin ve olumsuz etkilere neden olmaktadır. Nefret söylemlerini tespit etmek için otomatik dil sınıflandırma modellerine ihtiyaç vardır. Nefret söylemleri için dil modelleri test edilirken, bir veri sınıfının diğerinden çok daha sık temsil edildiği dengesiz veri kümeleri dil verilerinde sorun teşkil edebilir. Veri kümesi dengesiz dağılıma sahip olduğunda, sınıflandırıcı çoğunluk sınıfına yönelik önyargılı olabilir ve azınlık sınıfında iyi performans göstermeyebilir. Bu, yanlış veya güvenilmez sınıflandırma sonuçlarına yol açabilir. Bu sorunu çözmek için veri kümesi sınıflandırılmadan önce oversampling veya undersampling gibi veri düzeyi dengeleme yöntemleri ile veri sınıfları dengelenir. Bu çalışmada, veri düzeyi dengeleme yöntemleri kullanılarak nefret söylemini tespit eden başarılı bir sınıflandırma modeli kombinasyonu elde etmek amaçlanmaktadır. Bu amaçla, Instagram'dan elde edilmiş dengesiz etiket dağılımına sahip Abusive Turkish Comments (ATC) veri kümesine sekiz veri düzeyinde (rastgele oversampling, Synthetic Minority Oversampling Technique (SMOTE), Kmeans SMOTE, Localized Random Affine Shadow Sample (LoRAS), Textbased Generative Adversarial Network (TextGAN), Nearmiss, Tomek Links ve Clustering-based) dengeleme yöntemi uygulanarak kapsamlı bir çalışma yapılmıştır. Veri düzeyi dengeleme yöntemlerinin sınıflandırma performansları Basic Machine Learning (BML) ve Convolutional Neural Network (CNN) yöntemleriyle değerlendirilmiştir. TextGAN veri düzeyi dengeleme yöntemine dayalı CBoW+CNN modelinin ve Skip-gram CNN modelinin 0,972 Makro Ortalamalı F1 puanı ile en iyi sınıflandırma performansını sergilediği görülmüştür.

Anahtar Kelimeler

Performance comparison of data balancing techniques on hate speech detection in Turkish

Öz

Increasing hate speech on social media platforms causes psychological disorders and deep and negative effects. Automatic language classification models are needed to detect hate speech. When testing language models for hate speech, imbalanced datasets where one data class is represented much more frequently than the other can be a problem in language datasets. When the dataset is imbalanced, the classifier may be biased towards the majority class and may not perform well in the minority class. This can lead to incorrect or unreliable classification results. To solve this problem, data level balancing methods such as oversampling or undersampling are used to balance the class distribution before classifying the dataset. This study, it is aimed to achieve a successful classification model combination that detects hate speech by using data-level balancing methods. For this, a comprehensive study was carried out by applying the balancing method at eight data levels (random oversampling, Synthetic Minority Oversampling Technique (SMOTE), K-means SMOTE, Localized Random Affine Shadow Sample (LoRAS), Text-based Generative AdversarialNetwork (TextGAN), Nearmiss, Tomek Links ve Clustering-based) to the Abusive Turkish Comments (ATC) dataset, which has an imbalanced distribution of labels, obtained from Instagram. Classification performances of data level balancing methods were evaluated with Basic Machine Learning (BML) and Convolutional Neural Network (CNN) methods. It has been observed that the CBoW+CNN model based on the TextGAN data-level balancing method, as well as the Skip-gram CNN model, exhibited the best classification performance with a MacroAveraged F1 score of 0.972.

Anahtar Kelimeler

Kaynakça

[1] Hudson DLJ. “Is Cyberbullying free speech”. American Bar Association Journal, 102, 1-8, 2016.
[2] Kottasová, I. “Europe says Twitter is failing to remove hate speech”. https://money.cnn.com/2017/06/01/technology/twitte r-facebook-hate-speech-europe/index.html (19.12.2019).
[3] Park JH, Fung P. “One-step and Two-step Classification for Abusive Language Detection on Twitter”. arXiv, 2017. https://arxiv.org/pdf/1706.01206.pdf.
[4] Chen H, McKeever S, Delany S J. “Harnessing the power of text mining for the detection of abusive content in social media”. Advances in Computational Intelligence Systems: Workshop on Computational Intelligence, Lancaster, UK, 7-9 September 2017.
[5] Wiegand M, Siegel M, Ruppenhofer J. “Overview of the GermEval 2018 Shared Task on the Identification of Offensive Language”. 14. Conference on Natural Language Processing, Vienna, Austria, 21 September 2018.
[6] Davidson T, Warmsley D, Macy M, Weber I. “Automated Hate Speech Detection and the Problem of Offensive Language”. 11. Conference on Web and Social Media, Montreal, Canada, 15-18 May 2017.
[7] Karayigit H, Acı Cİ, Akdaglı A. “Detecting abusive Instagram comments in Turkish using convolutional Neural network and machine learning methods”. Expert System with Applications, 174, 1-14, 2021.
[8] Ozel SA, Akdemir S, Sarac E, Aksu H. “Detection of cyberbullying on social media messages in Turkish”. 2017 International Conference on Computer Science and Engineering (UBMK), Antalya, Turkey, 05-08 October 2017.

[9] Wearesocial. Creative Agency. “We Are Social UK. (2021)”. https://wearesocial.com/uk/ (08.011.2021).
[10] Waseem Z. “Are You a Racist or Am I Seeing Things? Annotator Influence on Hate Speech Detection on Twitter”. 1. Workshop on NLP and CSS, Austin, TX, USA, 5 November 2016.
[11] Zhang Z, Luo L. “Hate speech detection: a solved problem? the challenging case of long tail on twitter”. Semantic Web, 10, 925-945, 2018.
[12] Badjatiya P, Gupta, S Gupta, M, Varma V. “Deep learning for hate speech detection in tweets”. 26. International Conference on World Wide Web Companion, Perth, Australia, 3-7 April 2017.
[13] Tolba M, Ouadfel S, Meshoul S. “Hybrid ensemble approaches to online harassment detection in highly imbalanced data”. Expert System with Appications, 175, 1-13, 2021.
[14] Aydilek İB. "Yazılım hata tahmininde kullanılan metriklerin karar ağaçlarındaki bilgi kazançlarının incelenmesi ve iyileştirilmesi". Pamukkale University Journal of Engineering Sciences, 24, 906-914, 2018.
[15] Elkan C. “The foundations of cost-sensitive learning”. International Joint Conference on Artificial, San Francisco, CA, USA, 04 August 2001.
[16] Davidson T, Warmsley D, Macy M, Weber I. “Automated hate speech detection and the problem of offensive language”. 11. Conference on Web and Social Media, Montreal, Canada, 15-18 May 2017.
[17] Waseem Z, Hovy D. “Hateful symbols or hateful people? predictive features for hate speech detection on twitter”. The NAACL Student Research Workshop, San Diego, California, USA, 12-17 June 2016.
[18] ElSherief M, Nilizadeh S, Nguyen D, Vigna G, Belding E. “Peer to peer hate: hate speech instigators and their targets”. 12. International AAAI Conference on Web and Social Media, Stanford, USA, 25-28 June 2018.
[19] Ross B, Rist M, Carbonell G, Cabrera B, Kurowsky N, Wojatzki M. “Measuring the reliability of hate speech annotations: the case of the european refugee crisis”. arXiv, 2017. https://arxiv.org/pdf/1701.08118.pdf.
[20] Vigna F, Del Cimino A, Dell’orletta F, Petrocchi M, Tesconi M. “Hate me, hate me not: Hate speech detection on Facebook”. 1. Italian Conference on Cybersecurity (ITASEC17), Venice, Italy, 17-20 January 2017.
[21] Kwok I, Wang Y. “Locate the hate: detecting tweets against blacks”. 27. AAAI Conference on Artificial Intelligence, Bellevue, Washington, USA, 14-18 July 2013.
[22] Warner W, Hirschberg J. “Detecting hate speech on the world wide web”. 2. Workshop on Language in Social Media, Montreal, Canada, 07 June 2012.
[23] Burnap P, Williams ML. “Cyber hate speech on twitter: An application of machine classification and statistical modeling for policy and decision making”. Policy and Internet, 7, 223-242, 2015.
[24] Djuric N, Zhou J, Morris R, Grbovic M, Radosavljevic V, Bhamidipati N. “Hate speech detection with comment embeddings”. 24. International World Wide Web Conference, Florence, Italy, 18-22 May 2015.
[25] Founta AM, Chatzakou D, Kourtellis N, Blackburn J, Vakali A, Leontiadis I. “A unified deep learning architecture for abuse detection”. 10. ACM Conference on Web Science, Amsterdam, Netherlands, 27-30 May 2018.
[26] Song G, Huang D, Zhang Y. “A hybrid model for monolingual and multilingual toxic comment detection”. Tehnički Vjesnik, 28, 1667-1673, 2021.
[27] He H, Bai Y, Garcia EA, Li S. “ADASYN: Adaptive synthetic sampling approach for imbalanced learning”. 2008 IEEE International Joint Conference on Neural Networks, Hong Kong, China, 1-8 June 2008.
[28] Al-Garadi MA, Varathan KD, Ravana SD. “Cybercrime detection in online communications: The experimental case of cyberbullying detection in the Twitter network”. Computers in Human Behavior, 63, 433-443, 2016.
[29] Last F, Douzas G, Bacao F. “Oversampling for Imbalanced Learning Based on K-Means and SMOTE”. arXiv, 2017. https://arxiv.org/pdf/1711.00837.
[30] Zhang Y, Gan Z, Fan K, Chen Z, Henao R, Shen D, Carin L. “Adversarial feature matching for text generation”. The 24 International Conference on Machine Learning, Sydney, Australia, 6-11 August 2017.
[31] Ćosović M, Obradović S. “BGP anomaly detection with balanced datasets”. Tehnički vjesnik, 25, 766-775, 2018.
[32] Anand A, Gorde K, Moniz JRA, Park N, Chakraborty T, Chu BT. “Phishing URL detection with oversampling based on text generative adversarial networks”. IEEE International Conference on Big Data (Big Data), Seattle, USA, 10-13 December 2018.
[33] Bej S, Davtyan N, Wolfien M, Nassar M, Wolkenhauer O. “LoRAS: an oversampling approach for imbalanced datasets”. Machine Learning, 110, 279-301, 2020.
[34] Srinivasan R, Subalalitha CN. “Sentimental analysis from imbalanced code-mixed data using machine learning approaches”. Distributed and Parallel Databases, 41, 1-16, 2021.
[35] Rupapara V, Rustam F, Shahzad HF, Mehmood A, Ashraf I, Choi GS. “Impact of SMOTE on imbalanced text features for toxic comments classification using RVVC model”. IEEE Access, 9, 7862-7863, 2021.
[36] Imran A, Yang R, Kastrati Z, Daudpota S. “The impact of synthetic text generation for sentiment analysis using GAN based models”. Egyptian Informatics Journal, 23, 547-557, 2022.
[37] Rao S, Verma A, Bhatia T. “Hybrid ensemble framework with self-attention mechanism for social spam detection on imbalanced data”. Expert System with Appications, 217, 1-21, 2023.
[38] Madani M, Motameni H, Mohamadi H. “KNNGAN: An oversampling technique for textual imbalanced datasets”. The Journal of Supercomputing, 115, 1-18, 2022.
[39] Bozkurt F, Çoban Ö, Baturalp GF, Yücel AŞ. “High performance twitter sentiment analysis using cuda based distance kernel on GPUs”. Tehnički Vjesnik, 26, 1218-1227, 2019.
[40] AktunçH. Big Slang Dictionary of Turkish :(with witnesses). İstanbul, Türkiye, Yapı Kredi Yayınları, 2000.
[41] Scikit Developers. “Sklearn. Model_selection. Stratifiedkfold-Scikit-learn 1.2.1 Documentation”. https://scikitlearn.org/stable/modules/generated/sklearn.model_sel ection.StratifiedKFold.html (07.03.2023).
[42] Tekin MC, TUNALI V. "Yazılım geliştirme taleplerinin metin madenciliği yöntemleriyle önceliklendirilmesi". Pamukkale University Journal of Engineering Sciences, 25, 615-620, 2019.
[43] Tabinda Kokab S, Asghar S, Naz S. “Transformer-based deep learning models for the sentiment analysis of social media data”. Array, 14, 1-12, 2022.
[44] Abdi A, Shamsuddin SM, Hasan S, Piran J. “Deep learningbased sentiment classification of evaluative text based on Multi-feature fusion”. Information Processing and Management, 56, 1245-1259, 2019.
[45] Cevik F, Kilimci ZH. "Derin öğrenme yöntemleri ve kelime yerleştirme modelleri kullanılarak parkinson hastalığının duygu analiziyle değerlendirilmesi". Pamukkale University Journal of Engineering Sciences, 27(2), 151-161, 2020.
[46] Fatima M, Pasha M. “Survey of machine learning algorithms for disease diagnostic”. Journal of Intelligent Learning Systems and Applications, 9, 1-16, 2017.
[47] Abooraig R, Al-Zu'bi S, Kanan T, Hawashin B, Al Ayoub M, Hmeidi I. “Automatic categorization of Arabic articles based on their political orientation”. Digital Investigation, 25, 24-41, 2018.
[48] Saric M, Dujmic H, Russo M. “Scene text extraction in ıhls color space using support vector machine”. Information Technology and Control, 44, 20-29, 2015.
[49] Breiman L. “Random forests”. Machine Learning, 45, 5-32, 2001.
[50] Tao W, Chang D. “News text classification based on an ımproved convolutional neural network”. Tehnički vjesnik, 26, 1400-1409, 2019.
[51] Dogan T, Uysal AK. “Improved inverse gravity moment term weighting for text classification”. Expert System with Applications, 130, 45-59, 2019.
[52] Saraç E, Özel SA. “Effects of feature extraction and classification methods on cyberbully detection”. Suleyman Demirel University Science Institute Journal, 21, 190-200, 2016.
[53] Scikit Developers. “Colaboratory”. https://colab.research.google.com/ (05.12.2021).
[54] Kedia A, Rasu M. Hands-On Python Natural Language Processing: Explore Tools and Techniques to Analyze and Process Text With a View To Building Real-World NLP Applications, 1st ed. Birmingham, UK, Packt Publishing, 2020.

Ayrıntılar

Birincil Dil

İngilizce

Konular

Algoritmalar ve Hesaplama Kuramı, Elektrik Mühendisliği (Diğer)

Bölüm

Araştırma Makalesi

Yazarlar

Habibe Karayiğit
Türkiye

Ali Akdağli ^*
Türkiye

Çiğdem İnan Acı
Türkiye

Yayımlanma Tarihi

30 Ekim 2024

Gönderilme Tarihi

13 Mart 2023

Kabul Tarihi

10 Ekim 2023

Yayımlandığı Sayı

Yıl 2024 Cilt: 30 Sayı: 5

IZ

https://izlik.org/JA87YW64ZT

Kaynak Göster

RIS / Bibtex

APA

Karayiğit, H., Akdağli, A., & Acı, Ç. İ. (2024). Performance comparison of data balancing techniques on hate speech detection in Turkish. Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi, 30(5), 610-621. https://izlik.org/JA87YW64ZT

AMA

1.Karayiğit H, Akdağli A, Acı Çİ. Performance comparison of data balancing techniques on hate speech detection in Turkish. Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi. 2024;30(5):610-621. https://izlik.org/JA87YW64ZT

Chicago

Karayiğit, Habibe, Ali Akdağli, ve Çiğdem İnan Acı. 2024. “Performance comparison of data balancing techniques on hate speech detection in Turkish”. Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi 30 (5): 610-21. https://izlik.org/JA87YW64ZT.

EndNote

Karayiğit H, Akdağli A, Acı Çİ (01 Ekim 2024) Performance comparison of data balancing techniques on hate speech detection in Turkish. Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi 30 5 610–621.

IEEE

[1]H. Karayiğit, A. Akdağli, ve Ç. İ. Acı, “Performance comparison of data balancing techniques on hate speech detection in Turkish”, Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi, c. 30, sy 5, ss. 610–621, Eki. 2024, [çevrimiçi]. Erişim adresi: https://izlik.org/JA87YW64ZT

ISNAD

Karayiğit, Habibe - Akdağli, Ali - Acı, Çiğdem İnan. “Performance comparison of data balancing techniques on hate speech detection in Turkish”. Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi 30/5 (01 Ekim 2024): 610-621. https://izlik.org/JA87YW64ZT.

JAMA

1.Karayiğit H, Akdağli A, Acı Çİ. Performance comparison of data balancing techniques on hate speech detection in Turkish. Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi. 2024;30:610–621.

MLA

Karayiğit, Habibe, vd. “Performance comparison of data balancing techniques on hate speech detection in Turkish”. Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi, c. 30, sy 5, Ekim 2024, ss. 610-21, https://izlik.org/JA87YW64ZT.

Vancouver

1.Habibe Karayiğit, Ali Akdağli, Çiğdem İnan Acı. Performance comparison of data balancing techniques on hate speech detection in Turkish. Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi [Internet]. 01 Ekim 2024;30(5):610-21. Erişim adresi: https://izlik.org/JA87YW64ZT