Offensive Language Detection in Turkish Language by Using NLP

Bekir Furkan Kesgin; Rüştü Murat Demirer

doi:10.16984/saufenbilder.1349956

Research Article

Offensive Language Detection in Turkish Language by Using NLP

Year 2025, Volume: 29 Issue: 1, 1 - 17

Bekir Furkan Kesgin , Rüştü Murat Demirer

https://doi.org/10.16984/saufenbilder.1349956

Abstract

The growing use of social media has increased online harassment, cyberhate, and the use of offensive language. This poses significant challenges for effectively detecting and addressing such issues. Natural Language Processing (NLP) has seen considerable advancements; however, automatically identifying offensive language remains a complex task due to the ambiguous and informal nature of user-generated content and the social context in which it occurs. In this thesis, our goal is to develop methods for automatic detection of offensive language in social media. Multiple classification algorithms, including Multinomial Naive Bayes, Gaussian Naive Bayes, SVM, Logistic Regression, and LSTM, are implemented and evaluated. Key measures including accuracy, F1 score, and AUC score are used to evaluate how well these algorithms work. Results show that the Random Forest Classifier obtains an AUC score of 0.65 and an accuracy of 0.82 without word2vec. On the other hand, LSTM demonstrates a competitive AUC score of 0.78 when compared to the Random Forest Classifier. These findings provide insights into the effectiveness of different algorithms for offensive language detection. The research contributes to the field by providing valuable tools and insights to enhance Turkish language processing and prioritize online safety, particularly in combating cyberbullying and fostering a tolerant online environment. The findings also pave the way for future research endeavors in natural language processing and have practical implications for protecting individuals and promoting a secure online space.

Keywords

Cyberhate, Social media, Natural language processing, Classification algorithms, Cyberbullying

References

dalı, E., "Natural Language Processing," Turkish Journal of Electrical Engineering & Computer Sciences, vol. 24, no. 2, pp. 1–17, 2016.
Rosenthal, S., Atanasova, P., Karadzhov, G., Zampieri, M., Nakov, P., "SOLID: A Large-Scale Semi-Supervised Dataset for Offensive Language Identification," arXiv preprint arXiv:2004.14454, 2020.
Çöltekin, Ç., "A Corpus of Turkish Offensive Language on Social Media," Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC 2020), pp. 1–8, 2020.
Casula, C., Aprosio, A. P., Menini, S., Tonelli, S., "FBK-DH at SemEval-2020 Task 12: Using Multi-channel BERT for Multilingual Offensive Language Detection," Proceedings of the 14th International Workshop on Semantic Evaluation (SemEval 2020), pp. 1–10, 2020.
Anil, Ö., Yeniterzi, R., "SU-NLP at SemEval-2020 Task 12: Offensive Language Identification in Turkish Tweets," Proceedings of the 14th International Workshop on Semantic Evaluation (SemEval 2020), pp. 1–8, 2020.
Ma, L., Liu, Y., Zhang, X., Ye, Y., Yin, G., Johnson, B. F. G., "Deep Learning in Remote Sensing Applications: A Meta-analysis and Review," ISPRS Journal of Photogrammetry and Remote Sensing, vol. 152, pp. 166–177, 2019.
Mikolov, T., Chen, K., Corrado, G. S., Dean, J., "Efficient Estimation of Word Representations in Vector Space," arXiv preprint arXiv:1301.3781, 2013.
Potdar, K., Pardawala, T. S., Pai, C. D., "A Comparative Study of Categorical Variable Encoding Techniques for Neural Network Classifiers," International Journal of Computer Applications, vol. 175, no. 4, pp. 7–9, 2017.
Gao, W., Zhou, Z., "Towards Convergence Rate Analysis of Random Forests for Classification," Artificial Intelligence, vol. 313, p. 103788, 2020.
Njoku, O. C., "Decision Trees and Their Application for Classification and Regression Problems," M.S. thesis, Missouri State University, 2020. [Online]. Available: https://bearworks.missouristate.edu/theses/3406.
Ampomah, E. K., Qin, Z., Nyame, G., "Evaluation of Tree-Based Ensemble Machine Learning Models in Predicting Stock Price Direction of Movement," Information, vol. 11, no. 6, p. 332, 2020.
Lin, X., "Sentiment Analysis of E-commerce Customer Reviews Based on Natural Language Processing," Proceedings of the 2020 International Conference on Artificial Intelligence and Computer Engineering (ICAICE), pp. 1–5, 2020.
Apostolidis-Afentoulis, V., "SVM Classification with Linear and RBF Kernels," ResearchGate, 2015.
Razin, J. I., Karim, A., Mridha, M. F., Rifat, S. M. R., Alam, T., "A Long Short-Term Memory (LSTM) Model for Business Sentiment Analysis Based on Recurrent Neural Network," in Lecture Notes on Data Engineering and Communications Technologies, Springer, pp. 1–15, 2021.
Staudemeyer, R. C., Morris, E. R., "Understanding LSTM: A Tutorial into Long Short-Term Memory Recurrent Neural Networks," ResearchGate, 2019.
Naulak, C., "A Comparative Study of Naive Bayes Classifiers with Improved Technique on Text Classification," TechRxiv, 2022.
Kolukisa, A. A., "Turkish Character Usage in Text Classification (JAIDA)," ResearchGate, 2021.
Karimi, Z., "Confusion Matrix," ResearchGate, 2021. [Online]. Available: https://www.researchgate.net/publication/355096788_Confusion_Matrix.
Akram, S., "CLASSIFICATION REPORT," ResearchGate, 2021. [Online]. Available: https://www.researchgate.net/publication/357974052_CLASSIFICATION_REPORT.
Kong, Q., Wang, W., Zhang, D., Zhang, W., "Two Kinds of Average Approximation Accuracy," CAAI Transactions on Intelligence Technology, 2023.

NLP Kullanarak Türkçede Saldırgan Dil Tespiti

Year 2025, Volume: 29 Issue: 1, 1 - 17

Bekir Furkan Kesgin , Rüştü Murat Demirer

https://doi.org/10.16984/saufenbilder.1349956

Abstract

Sosyal medyanın artan kullanımı, çevrimiçi taciz, siber nefret ve saldırgan dil kullanımını artırmıştır. Bu durum, bu tür sorunların etkili bir şekilde tespit edilmesi ve ele alınması için önemli zorluklar ortaya çıkarmaktadır. Doğal Dil İşleme (NLP) önemli ilerlemeler kaydetmiştir; ancak, kullanıcı tarafından oluşturulan içeriğin belirsiz ve gayri resmi doğası ve meydana geldiği sosyal bağlam nedeniyle saldırgan dili otomatik olarak tanımlamak karmaşık bir görev olmaya devam etmektedir. Bu tezde amacımız, sosyal medyada saldırgan dilin otomatik olarak tespit edilmesi için yöntemler geliştirmektir. Multinomial Naive Bayes, Gaussian Naive Bayes, SVM, Logistic Regression ve LSTM dahil olmak üzere çoklu sınıflandırma algoritmaları uygulanmış ve değerlendirilmiştir. Bu algoritmaların ne kadar iyi çalıştığını değerlendirmek için doğruluk, F1 puanı ve AUC puanı gibi temel ölçütler kullanılır. Sonuçlar, Rastgele Orman Sınıflandırıcısının word2vec olmadan 0,65 AUC puanı ve 0,82 doğruluk elde ettiğini göstermektedir. Öte yandan, LSTM, Rastgele Orman Sınıflandırıcısı ile karşılaştırıldığında 0,78'lik rekabetçi bir AUC puanı göstermektedir. Bu bulgular, saldırgan dil tespiti için farklı algoritmaların etkinliği hakkında fikir vermektedir. Araştırma, Türkçe dil işlemeyi geliştirmek ve özellikle siber zorbalıkla mücadelede ve hoşgörülü bir çevrimiçi ortamı teşvik etmede çevrimiçi güvenliği önceliklendirmek için değerli araçlar ve içgörüler sağlayarak alana katkıda bulunmaktadır. Bulgular ayrıca doğal dil işleme alanında gelecekteki araştırma çabalarının önünü açmakta ve bireylerin korunması ve güvenli bir çevrimiçi alanın teşvik edilmesi için pratik sonuçlar doğurmaktadır.

Keywords

Cyberhate, Sosyal Medya, Doğal Dil İşleme, Sınıflandırma Algoritmaları, Siber Zorbalık

References

dalı, E., "Natural Language Processing," Turkish Journal of Electrical Engineering & Computer Sciences, vol. 24, no. 2, pp. 1–17, 2016.
Rosenthal, S., Atanasova, P., Karadzhov, G., Zampieri, M., Nakov, P., "SOLID: A Large-Scale Semi-Supervised Dataset for Offensive Language Identification," arXiv preprint arXiv:2004.14454, 2020.
Çöltekin, Ç., "A Corpus of Turkish Offensive Language on Social Media," Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC 2020), pp. 1–8, 2020.
Casula, C., Aprosio, A. P., Menini, S., Tonelli, S., "FBK-DH at SemEval-2020 Task 12: Using Multi-channel BERT for Multilingual Offensive Language Detection," Proceedings of the 14th International Workshop on Semantic Evaluation (SemEval 2020), pp. 1–10, 2020.
Anil, Ö., Yeniterzi, R., "SU-NLP at SemEval-2020 Task 12: Offensive Language Identification in Turkish Tweets," Proceedings of the 14th International Workshop on Semantic Evaluation (SemEval 2020), pp. 1–8, 2020.
Ma, L., Liu, Y., Zhang, X., Ye, Y., Yin, G., Johnson, B. F. G., "Deep Learning in Remote Sensing Applications: A Meta-analysis and Review," ISPRS Journal of Photogrammetry and Remote Sensing, vol. 152, pp. 166–177, 2019.
Mikolov, T., Chen, K., Corrado, G. S., Dean, J., "Efficient Estimation of Word Representations in Vector Space," arXiv preprint arXiv:1301.3781, 2013.
Potdar, K., Pardawala, T. S., Pai, C. D., "A Comparative Study of Categorical Variable Encoding Techniques for Neural Network Classifiers," International Journal of Computer Applications, vol. 175, no. 4, pp. 7–9, 2017.
Gao, W., Zhou, Z., "Towards Convergence Rate Analysis of Random Forests for Classification," Artificial Intelligence, vol. 313, p. 103788, 2020.
Njoku, O. C., "Decision Trees and Their Application for Classification and Regression Problems," M.S. thesis, Missouri State University, 2020. [Online]. Available: https://bearworks.missouristate.edu/theses/3406.
Ampomah, E. K., Qin, Z., Nyame, G., "Evaluation of Tree-Based Ensemble Machine Learning Models in Predicting Stock Price Direction of Movement," Information, vol. 11, no. 6, p. 332, 2020.
Lin, X., "Sentiment Analysis of E-commerce Customer Reviews Based on Natural Language Processing," Proceedings of the 2020 International Conference on Artificial Intelligence and Computer Engineering (ICAICE), pp. 1–5, 2020.
Apostolidis-Afentoulis, V., "SVM Classification with Linear and RBF Kernels," ResearchGate, 2015.
Razin, J. I., Karim, A., Mridha, M. F., Rifat, S. M. R., Alam, T., "A Long Short-Term Memory (LSTM) Model for Business Sentiment Analysis Based on Recurrent Neural Network," in Lecture Notes on Data Engineering and Communications Technologies, Springer, pp. 1–15, 2021.
Staudemeyer, R. C., Morris, E. R., "Understanding LSTM: A Tutorial into Long Short-Term Memory Recurrent Neural Networks," ResearchGate, 2019.
Naulak, C., "A Comparative Study of Naive Bayes Classifiers with Improved Technique on Text Classification," TechRxiv, 2022.
Kolukisa, A. A., "Turkish Character Usage in Text Classification (JAIDA)," ResearchGate, 2021.
Karimi, Z., "Confusion Matrix," ResearchGate, 2021. [Online]. Available: https://www.researchgate.net/publication/355096788_Confusion_Matrix.
Akram, S., "CLASSIFICATION REPORT," ResearchGate, 2021. [Online]. Available: https://www.researchgate.net/publication/357974052_CLASSIFICATION_REPORT.
Kong, Q., Wang, W., Zhang, D., Zhang, W., "Two Kinds of Average Approximation Accuracy," CAAI Transactions on Intelligence Technology, 2023.

There are 20 citations in total.

Details

Primary Language	English
Subjects	Machine Learning (Other), Cybersecurity and Privacy (Other)
Journal Section	Research Articles
Authors	Bekir Furkan Kesgin 0009-0002-7875-637X Rüştü Murat Demirer 0000-0002-5508-741X
Early Pub Date	February 12, 2025
Publication Date
Submission Date	August 25, 2023
Acceptance Date	January 12, 2025
Published in Issue	Year 2025 Volume: 29 Issue: 1

Cite

APA	Kesgin, B. F., & Demirer, R. M. (2025). Offensive Language Detection in Turkish Language by Using NLP. Sakarya University Journal of Science, 29(1), 1-17. https://doi.org/10.16984/saufenbilder.1349956
AMA	Kesgin BF, Demirer RM. Offensive Language Detection in Turkish Language by Using NLP. SAUJS. February 2025;29(1):1-17. doi:10.16984/saufenbilder.1349956
Chicago	Kesgin, Bekir Furkan, and Rüştü Murat Demirer. “Offensive Language Detection in Turkish Language by Using NLP”. Sakarya University Journal of Science 29, no. 1 (February 2025): 1-17. https://doi.org/10.16984/saufenbilder.1349956.
EndNote	Kesgin BF, Demirer RM (February 1, 2025) Offensive Language Detection in Turkish Language by Using NLP. Sakarya University Journal of Science 29 1 1–17.
IEEE	B. F. Kesgin and R. M. Demirer, “Offensive Language Detection in Turkish Language by Using NLP”, SAUJS, vol. 29, no. 1, pp. 1–17, 2025, doi: 10.16984/saufenbilder.1349956.
ISNAD	Kesgin, Bekir Furkan - Demirer, Rüştü Murat. “Offensive Language Detection in Turkish Language by Using NLP”. Sakarya University Journal of Science 29/1 (February 2025), 1-17. https://doi.org/10.16984/saufenbilder.1349956.
JAMA	Kesgin BF, Demirer RM. Offensive Language Detection in Turkish Language by Using NLP. SAUJS. 2025;29:1–17.
MLA	Kesgin, Bekir Furkan and Rüştü Murat Demirer. “Offensive Language Detection in Turkish Language by Using NLP”. Sakarya University Journal of Science, vol. 29, no. 1, 2025, pp. 1-17, doi:10.16984/saufenbilder.1349956.
Vancouver	Kesgin BF, Demirer RM. Offensive Language Detection in Turkish Language by Using NLP. SAUJS. 2025;29(1):1-17.