Araştırma Makalesi
BibTex RIS Kaynak Göster

Hate Speech Classification with Machine Learning and Ensemble Learning

Yıl 2025, Cilt: 8 Sayı: 1, 56 - 65, 23.06.2025

Öz

It is becoming more and more apparent that social life has reached a breaking point with the
unhealthy communication between people due to the technological developments of recent
years. People are very tense and have unbearable emotions towards each other. The
expression of these emotions has begun to be seen in social media applications. Factors such
as pandemics and wars also contribute to the increase of this problem. In this study, after
natural language processing techniques on Reddit, Twitter, and 4Chan data, texts were
represented with text representations (TF-IDF, BoW, and Word2Vec CBoW and Skip-Gram).
These representations were then classified as containing or not containing hate speech using
machine learning (Decision Tree, K-Nearest Neighbor, Logistic Regression, Naive Bayes, and
Support Vector Machine) and ensemble learning (AdaBoost, Hard Voting, Soft Voting, Stacking,
and XGBooost) methods. The models were evaluated using Precision, Recall, F1 score, and
Accuracy with 80%-20% training test separation. The best result was obtained with 97.20%
Accuracy, 97.61% F1, 95.90% Recall, and 99.39% Precision with the model built using
machine learning algorithms along with Stacking after Word2Vec CBoW. This study shows
that the Word2Vec method, which is one of the prediction-based methods, gives good results
even in unbalanced datasets.

Kaynakça

  • McCarthy J, Minsky ML, Rochester N, Shannon CE. “A Proposal for the Dartmouth Summer Research Project on Artificial Intelligence”. 2006.
  • We are social. “Digital 2023, Global Overview Report”. [Online]. Available: https://wearesocial.com/wp-content/uploads/2023/03/Digital-2023-Global-Overview-Report.pdf
  • Vardal ZB. “Nefret Söylemi ve Yeni Medya”. Maltepe Üniversitesi İletişim Fakültesi Dergisi, 2(1), 132–156, 2016.
  • Saifullah S, Dreżewski R, Dwiyanto FA, Aribowo AS, Fauziah Y, Cahyana NH. “Automated Text Annotation Using a Semi-Supervised Approach with Meta Vectorizer and Machine Learning Algorithms for Hate Speech Detection”. Applied Sciences, 14(3), 1078, Jan. 2024. doi: 10.3390/app14031078.
  • Beyhan F, et al. “A Turkish Hate Speech Dataset and Detection System”. 2022. [Online]. Available: https://github.com/verimsu/
  • Jiang Y, Dale R, Lu H. “Transformability, generalizability, but limited diffusibility: Comparing global vs. task-specific language representations in deep neural networks”. Cogn Syst Res, 83, 101184, Jan. 2024. doi: 10.1016/j.cogsys.2023.101184.
  • Althobaiti MJ. “BERT-based Approach to Arabic Hate Speech and Offensive Language Detection in Twitter: Exploiting Emojis and Sentiment Analysis”. International Journal of Advanced Computer Science and Applications, 13(5), 2022. doi: 10.14569/IJACSA.2022.01305109.
  • Abdul Aziz NA, Maarof MA, Zainal A. “Hate Speech and Offensive Language Detection: A New Feature Set with Filter-Embedded Combining Feature Selection”. 3rd International Cyber Resilience Conference (CRC), IEEE, Jan. 2021, 1–6. doi: 10.1109/CRC50527.2021.9392486.
  • Mercan V, Jamil A, Hameed AA, Magsi IA, Bazai S, Shah SA. “Hate Speech and Offensive Language Detection from Social Media”. International Conference on Computing, Electronic and Electrical Engineering (ICE Cube), IEEE, Oct. 2021, 1–5. doi: 10.1109/ICECube53880.2021.9628255.
  • Ayo FE, Folorunso O, Ibharalu FT, Osinuga IA. “Machine learning techniques for hate speech classification of Twitter data: State-of-the-art, future challenges and research directions”. Comput Sci Rev, 38, 100311, Nov. 2020. doi: 10.1016/j.cosrev.2020.100311.
  • Abdurrahman MH, Irawan B, Setianingsih C. “A Review of Light Gradient Boosting Machine Method for Hate Speech Classification on Twitter”. 2nd International Conference on Electrical, Control and Instrumentation Engineering (ICECIE), IEEE, Nov. 2020, 1–6. doi: 10.1109/ICECIE50279.2020.9309565.
  • Putri TTA, Sriadhi S, Sari RD, Rahmadani R, Hutahaean HD. “A comparison of classification algorithms for hate speech detection”. IOP Conf Ser Mater Sci Eng, 830(3), 032006, Apr. 2020. doi: 10.1088/1757-899X/830/3/032006.
  • Pereira-Kohatsu JC, Quijano-Sánchez L, Liberatore F, Camacho-Collados M. “Detecting and Monitoring Hate Speech in Twitter”. Sensors, 19(21), 4654, Oct. 2019. doi: 10.3390/s19214654.
  • MacAvaney S, Yao H-R, Yang E, Russell K, Goharian N, Frieder O. “Hate speech detection: Challenges and solutions”. PLoS One, 14(8), e0221152, Aug. 2019. doi: 10.1371/journal.pone.0221152.
  • Başarslan MS, Kayaalp F. “Sentiment analysis using a deep ensemble learning model”. Multimed Tools Appl, 83(14), 42207–42231, Oct. 2023. doi: 10.1007/s11042-023-17278-6.
  • Mikolov J, Sutskever T, Chen K, Corrado GS, Dean. “Distributed representations of words and phrases and their compositionality”. Advances in Neural Information Processing Systems, 2013. [Online]. Available: https://proceedings.neurips.cc/paper
  • Bafna P, Pramod D, Vaidya A. “Document clustering: TF-IDF approach”. International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT), IEEE, 2016, 61–66. doi: 10.1109/ICEEOT.2016.7754750.
  • Sreelakshmi K, Premjith B, Chakravarthi BR, Soman KP. “Detection of Hate Speech and Offensive Language CodeMix Text in Dravidian Languages Using Cost-Sensitive Learning Approach”. IEEE Access, 12, 20064–20090, 2024. doi: 10.1109/ACCESS.2024.3358811.
  • Cooke S. “Labelled Hate Speech Detection Dataset”.
  • Başa SN, Basarslan MS. “Sentiment Analysis Using Machine Learning Techniques on IMDB Dataset”. 7th International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT), IEEE, Oct. 2023, 1–5. doi: 10.1109/ISMSIT58785.2023.10304923.
  • Öztürk T, Turgut Z, Akgün G, Köse C. “Machine learning-based intrusion detection for SCADA systems in healthcare”. Network Modeling Analysis in Health Informatics and Bioinformatics, 11(1), 47, Dec. 2022. doi: 10.1007/s13721-022-00390-2.
  • Polikar . “Ensemble based systems in decision making”. IEEE Circuits and Systems Magazine, 6(3), 21–45, 2006. doi: 10.1109/MCAS.2006.1688199.
  • Ahmad I, Yousaf M, Yousaf S, Ahmad MO. “Fake News Detection Using Machine Learning Ensemble Methods”. Complexity, 2020, 1–11, Oct. 2020. doi: 10.1155/2020/8885861.
  • Wang G, Sun J, Ma J, Xu K, Gu J. “Sentiment classification: The contribution of ensemble learning”. Decis Support Syst, 57, 77–93, Jan. 2014. doi: 10.1016/j.dss.2013.08.002.
  • Mohammadifar A, Gholami H, Golzari S. “Stacking- and voting-based ensemble deep learning models (SEDL and VEDL) and active learning (AL) for mapping land subsidence”. Environmental Science and Pollution Research, 30(10), 26580–26595, Nov. 2022. doi: 10.1007/s11356-022-24065-7.
  • Khaliki MZ, Başarslan MS. “Brain tumor detection from images and comparison with transfer learning methods and 3-layer CNN”. Sci Rep, 14(1), 2664, Feb. 2024. doi: 10.1038/s41598-024-52823-9.

Makine Öğrenmesi ve Topluluk Öğrenmesi ile Nefret Söylemi Sınıflandırması

Yıl 2025, Cilt: 8 Sayı: 1, 56 - 65, 23.06.2025

Öz

Son yıllardaki teknolojik gelişmeler nedeniyle insanlar arasındaki sağlıksız iletişim, sosyal
hayatın bir kırılma noktasına ulaştığını giderek daha belirgin hale getirmektedir. İnsanlar
oldukça gergin ve birbirlerine karşı katlanılmaz duygular beslemektedir. Bu duyguların
ifadesi, sosyal medya uygulamalarında görülmeye başlanmıştır. Pandemi ve savaşlar gibi
faktörler de bu sorunun artışına katkıda bulunmaktadır. Bu çalışmada, Reddit, Twitter ve
4Chan verileri üzerinde doğal dil işleme teknikleri uygulandıktan sonra, metinler çeşitli metin
temsil yöntemleriyle (TF-IDF, BoW, Word2Vec CBoW ve Skip-Gram) temsilleri çıkarılmıştır.
Bu temsiller, nefret söylemi içerip içermediğine göre makine öğrenmesi (Karar Ağaçları, K-En
Yakın Komşu, Lojistik Regresyon, Naive Bayes ve Destek Vektör Makineleri) ve topluluk
öğrenme (AdaBoost, Hard Voting, Soft Voting, Stacking ve XGBoost) yöntemleri ile
sınıflandırılmıştır. Modeller, %80-%20 eğitim-test ayrımıyla Doğruluk, hassasiyet, hatırlama
ve F1 skoru kullanılarak değerlendirilmiştir. En iyi sonuç, Word2Vec CBoW temsili sonrası
Stacking ile oluşturulan modelde %97.20 doğruluk, %97.61 F1, %95.90 hatırlama ve %99.39
hassasiyet ile elde edilmiştir. Bu çalışma, tahmin temelli yöntemlerden biri olan Word2Vec
yönteminin, dengesiz veri setlerinde iyi sonuçlar verdiğini göstermektedir.

Kaynakça

  • McCarthy J, Minsky ML, Rochester N, Shannon CE. “A Proposal for the Dartmouth Summer Research Project on Artificial Intelligence”. 2006.
  • We are social. “Digital 2023, Global Overview Report”. [Online]. Available: https://wearesocial.com/wp-content/uploads/2023/03/Digital-2023-Global-Overview-Report.pdf
  • Vardal ZB. “Nefret Söylemi ve Yeni Medya”. Maltepe Üniversitesi İletişim Fakültesi Dergisi, 2(1), 132–156, 2016.
  • Saifullah S, Dreżewski R, Dwiyanto FA, Aribowo AS, Fauziah Y, Cahyana NH. “Automated Text Annotation Using a Semi-Supervised Approach with Meta Vectorizer and Machine Learning Algorithms for Hate Speech Detection”. Applied Sciences, 14(3), 1078, Jan. 2024. doi: 10.3390/app14031078.
  • Beyhan F, et al. “A Turkish Hate Speech Dataset and Detection System”. 2022. [Online]. Available: https://github.com/verimsu/
  • Jiang Y, Dale R, Lu H. “Transformability, generalizability, but limited diffusibility: Comparing global vs. task-specific language representations in deep neural networks”. Cogn Syst Res, 83, 101184, Jan. 2024. doi: 10.1016/j.cogsys.2023.101184.
  • Althobaiti MJ. “BERT-based Approach to Arabic Hate Speech and Offensive Language Detection in Twitter: Exploiting Emojis and Sentiment Analysis”. International Journal of Advanced Computer Science and Applications, 13(5), 2022. doi: 10.14569/IJACSA.2022.01305109.
  • Abdul Aziz NA, Maarof MA, Zainal A. “Hate Speech and Offensive Language Detection: A New Feature Set with Filter-Embedded Combining Feature Selection”. 3rd International Cyber Resilience Conference (CRC), IEEE, Jan. 2021, 1–6. doi: 10.1109/CRC50527.2021.9392486.
  • Mercan V, Jamil A, Hameed AA, Magsi IA, Bazai S, Shah SA. “Hate Speech and Offensive Language Detection from Social Media”. International Conference on Computing, Electronic and Electrical Engineering (ICE Cube), IEEE, Oct. 2021, 1–5. doi: 10.1109/ICECube53880.2021.9628255.
  • Ayo FE, Folorunso O, Ibharalu FT, Osinuga IA. “Machine learning techniques for hate speech classification of Twitter data: State-of-the-art, future challenges and research directions”. Comput Sci Rev, 38, 100311, Nov. 2020. doi: 10.1016/j.cosrev.2020.100311.
  • Abdurrahman MH, Irawan B, Setianingsih C. “A Review of Light Gradient Boosting Machine Method for Hate Speech Classification on Twitter”. 2nd International Conference on Electrical, Control and Instrumentation Engineering (ICECIE), IEEE, Nov. 2020, 1–6. doi: 10.1109/ICECIE50279.2020.9309565.
  • Putri TTA, Sriadhi S, Sari RD, Rahmadani R, Hutahaean HD. “A comparison of classification algorithms for hate speech detection”. IOP Conf Ser Mater Sci Eng, 830(3), 032006, Apr. 2020. doi: 10.1088/1757-899X/830/3/032006.
  • Pereira-Kohatsu JC, Quijano-Sánchez L, Liberatore F, Camacho-Collados M. “Detecting and Monitoring Hate Speech in Twitter”. Sensors, 19(21), 4654, Oct. 2019. doi: 10.3390/s19214654.
  • MacAvaney S, Yao H-R, Yang E, Russell K, Goharian N, Frieder O. “Hate speech detection: Challenges and solutions”. PLoS One, 14(8), e0221152, Aug. 2019. doi: 10.1371/journal.pone.0221152.
  • Başarslan MS, Kayaalp F. “Sentiment analysis using a deep ensemble learning model”. Multimed Tools Appl, 83(14), 42207–42231, Oct. 2023. doi: 10.1007/s11042-023-17278-6.
  • Mikolov J, Sutskever T, Chen K, Corrado GS, Dean. “Distributed representations of words and phrases and their compositionality”. Advances in Neural Information Processing Systems, 2013. [Online]. Available: https://proceedings.neurips.cc/paper
  • Bafna P, Pramod D, Vaidya A. “Document clustering: TF-IDF approach”. International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT), IEEE, 2016, 61–66. doi: 10.1109/ICEEOT.2016.7754750.
  • Sreelakshmi K, Premjith B, Chakravarthi BR, Soman KP. “Detection of Hate Speech and Offensive Language CodeMix Text in Dravidian Languages Using Cost-Sensitive Learning Approach”. IEEE Access, 12, 20064–20090, 2024. doi: 10.1109/ACCESS.2024.3358811.
  • Cooke S. “Labelled Hate Speech Detection Dataset”.
  • Başa SN, Basarslan MS. “Sentiment Analysis Using Machine Learning Techniques on IMDB Dataset”. 7th International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT), IEEE, Oct. 2023, 1–5. doi: 10.1109/ISMSIT58785.2023.10304923.
  • Öztürk T, Turgut Z, Akgün G, Köse C. “Machine learning-based intrusion detection for SCADA systems in healthcare”. Network Modeling Analysis in Health Informatics and Bioinformatics, 11(1), 47, Dec. 2022. doi: 10.1007/s13721-022-00390-2.
  • Polikar . “Ensemble based systems in decision making”. IEEE Circuits and Systems Magazine, 6(3), 21–45, 2006. doi: 10.1109/MCAS.2006.1688199.
  • Ahmad I, Yousaf M, Yousaf S, Ahmad MO. “Fake News Detection Using Machine Learning Ensemble Methods”. Complexity, 2020, 1–11, Oct. 2020. doi: 10.1155/2020/8885861.
  • Wang G, Sun J, Ma J, Xu K, Gu J. “Sentiment classification: The contribution of ensemble learning”. Decis Support Syst, 57, 77–93, Jan. 2014. doi: 10.1016/j.dss.2013.08.002.
  • Mohammadifar A, Gholami H, Golzari S. “Stacking- and voting-based ensemble deep learning models (SEDL and VEDL) and active learning (AL) for mapping land subsidence”. Environmental Science and Pollution Research, 30(10), 26580–26595, Nov. 2022. doi: 10.1007/s11356-022-24065-7.
  • Khaliki MZ, Başarslan MS. “Brain tumor detection from images and comparison with transfer learning methods and 3-layer CNN”. Sci Rep, 14(1), 2664, Feb. 2024. doi: 10.1038/s41598-024-52823-9.
Toplam 26 adet kaynakça vardır.

Ayrıntılar

Birincil Dil Türkçe
Konular Yapay Görme, Makine Öğrenme (Diğer), Doğal Dil İşleme
Bölüm Araştırma Makalesi
Yazarlar

Hüsnü Baran 0009-0005-5054-6284

Muhammet Sinan Başarslan 0000-0002-7996-9169

Gönderilme Tarihi 22 Aralık 2024
Kabul Tarihi 7 Mart 2025
Yayımlanma Tarihi 23 Haziran 2025
Yayımlandığı Sayı Yıl 2025 Cilt: 8 Sayı: 1

Kaynak Göster

APA Baran, H., & Başarslan, M. S. (2025). Makine Öğrenmesi ve Topluluk Öğrenmesi ile Nefret Söylemi Sınıflandırması. Veri Bilimi, 8(1), 56-65.



Dergimizin Tarandığı Dizinler (İndeksler)


Academic Resource Index

logo.png

journalseeker.researchbib.com

Google Scholar

scholar_logo_64dp.png

ASOS Index

asos-index.png

Rooting Index

logo.png

www.rootindexing.com

The JournalTOCs Index

journal-tocs-logo.jpg?w=584

www.journaltocs.ac.uk

General Impact Factor (GIF) Index

images?q=tbn%3AANd9GcQ0CrEQm4bHBnwh4XJv9I3ZCdHgQarj_qLyPTkGpeoRRmNh10eC

generalif.com

Directory of Research Journals Indexing

DRJI_Logo.jpg

olddrji.lbp.world/indexedJournals.aspx

I2OR Index

8c492a0a466f9b2cd59ec89595639a5c?AccessKeyId=245B99561176BAE11FEB&disposition=0&alloworigin=1

http://www.i2or.com/8.html



logo.png