Türkçe E-postalarda Spam Tespiti için Makine Öğrenme Yöntemlerinin ve Dil Modellerinin Analizi

Zekeriya Anıl Güven

doi:10.31590/ejosat.1234079

TR EN

Türkçe E-postalarda Spam Tespiti için Makine Öğrenme Yöntemlerinin ve Dil Modellerinin Analizi

Öz

Son zamanlarda teknolojinin ve sosyal ağların gelişmesiyle çevrimiçi karşılıklı etkileşim, herhangi konuda fikirlerini paylaşma oldukça önem kazanmıştır. Bu etkileşimlerin olumlu yanı olsa da olumsuz yanı da oldukça fazladır. Sosyal ağlarda kullanıcıların bilgilerini elde edip kullanıcıları taklit etmek güvenlik açısından büyük bir problemdir. Böylelikle kullanıcılar üzerinden dolandırıcılık vs. yapılabilmektedir. Kullanıcıları taklit edebilmek için en yaygın yol spam mesajların, e-postaların, vs. atılmasıdır. Güvenlik probleminin üstesinden gelmek için spam filtreleme, spam tespiti yöntemi geliştirme gibi işlemler uygulanmaktadır. Bu çalışmada Türkçe e-postalarda spam içeren e-postaların tespiti için Rastgele Orman, Lojistik Regresyon, Naive Bayes, Yapay Sinir Ağları makine öğrenme yöntemleri ve BERT, ELECTRA, ALBERT, DistilBERT dil modelleri analiz edilmiştir. Böylece dil modellerinin Türkçe için spam e-postaları sınıflandırmadaki etkisi gösterilmek istenmiştir. Deneysel çalışmaların sonucunda, spam e-postaları sınıflandırmada tüm dil modelleri makine öğrenme yöntemlerine göre daha başarılı olmuştur. Makine öğrenme yöntemlerinden yapay sinir ağları %90.15 doğrulu değeri elde ederken, en başarılı dil modelleri %94.08 doğruluk değeri ile BERT ve ELECTRA olmuştur.

Anahtar Kelimeler

Analysis of Machine Learning Methods and Language Models for Spam Detection in Turkish Emails

Öz

Recently, with the development of technology and social networks, online interaction, sharing ideas on any subject has gained importance. While there are positive aspects to these interactions, there are also many negative aspects. Obtaining users' information and impersonating users in social networks is a big problem in terms of security. Thus, fraud etc. can be done by under cover of users. The most common way to impersonate users is by sending spam messages, emails, etc. In order to overcome the security problem, processes such as spam filtering and spam detection method development are applied. In this study, Random Forest, Logistic Regression, Naive Bayes, Artificial Neural Networks machine learning methods and BERT, ELECTRA, ALBERT, DistilBERT language models were analyzed to detect e-mails containing spam in Turkish e-mails. Thus, it is aimed to show the effect of language models in classifying spam e-mails for Turkish. As a result of experimental studies, all language models were more successful than machine learning methods in classifying spam emails. While artificial neural networks from machine learning methods achieved 90.15% accuracy, the most successful language models were BERT and ELECTRA with 94.08% accuracy.

Anahtar Kelimeler

Kaynakça

Acikalin, U. U., Bardak, B., & Kutlu, M. (2020, October). Turkish sentiment analysis using bert. In 2020 28th Signal Processing and Communications Applications Conference (SIU) (pp. 1-4). IEEE.
Chen, S., Webb, G. I., Liu, L., & Ma, X. (2020). A novel selective naïve Bayes algorithm. Knowledge-Based Systems, 192, 105361.
Chen, H., Gilad-Bachrach, R., Han, K., Huang, Z., Jalali, A., Laine, K., & Lauter, K. (2018). Logistic regression over encrypted data from fully homomorphic encryption. BMC medical genomics, 11(4), 3-12.
Clark, K., Luong, M. T., Le, Q. V., & Manning, C. D. (2020). Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555.
Crawford, M., Khoshgoftaar, T. M., Prusa, J. D., Richter, A. N., & Al Najada, H. (2015). Survey of review spam detection using machine learning techniques. Journal of Big Data, 2(1), 1-24.
Çelıkten, A., & Bulut, H. (2021, June). Turkish Medical Text Classification Using BERT. In 2021 29th Signal Processing and Communications Applications Conference (SIU) (pp. 1-4). IEEE.
Dedeturk, B. K., & Akay, B. (2020). Spam filtering using a logistic regression model trained by an artificial bee colony algorithm. Applied Soft Computing, 91, 106229.
Deniz, E., Erbay, H., & Coşar, M. (2019, November). Classification of Turkish E-Mails with Doc2Vec. In 2019 1st International Informatics and Software Engineering Conference (UBMYK) (pp. 1-4). IEEE.

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Ekici, B. & Takcı, H. (2021). Spam Tespitinde Word2Vec ve TF-IDF Yöntemlerinin Karşılaştırılması ve Başarı Oranının Artırılması Üzerine Bir Çalışma. Bilecik Şeyh Edebali Üniversitesi Fen Bilimleri Dergisi, 8 (2), 646-655.
Eryılmaz, E. E., Şahin, D. Ö., & Kılıç, E. (2020, June). Filtering turkish spam using LSTM from deep learning techniques. In 2020 8th International Symposium on Digital Forensics and Security (ISDFS) (pp. 1-6). IEEE.
Guven, Z. A. (2021a). Comparison of BERT models and machine learning methods for sentiment analysis on Turkish tweets. In 2021 6th International Conference on Computer Science and Engineering (UBMK) (pp. 98-101). IEEE.
Guven, Z. A. (2021b). The Effect of BERT, ELECTRA and ALBERT Language Models on Sentiment Analysis for Turkish Product Reviews. In 2021 6th International Conference on Computer Science and Engineering (UBMK) (pp. 629-632). IEEE.
Isik, S., Kurt, Z., Anagun, Y., & Ozkan, K. (2020). Spam E-mail Classification Recurrent Neural Networks for Spam E-mail Classification on an Agglutinative Language. International Journal of Intelligent Systems and Applications in Engineering, 8(4), 221-227.
Ismail, S. S., Mansour, R. F., El-Aziz, A., Rasha, M., & Taloba, A. I. (2022). Efficient E-Mail Spam Detection Strategy Using Genetic Decision Tree Processing with NLP Features. Computational Intelligence and Neuroscience, 2022.
Karasoy, O., & Ballı, S. (2022). Spam SMS detection for Turkish language with deep text analysis and deep learning methods. Arabian Journal for Science and Engineering, 47(8), 9361-9377.
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2019). Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942.
Petroni, F., Rocktäschel, T., Lewis, P., Bakhtin, A., Wu, Y., Miller, A. H., & Riedel, S. (2019). Language models as knowledge bases?. arXiv preprint arXiv:1909.01066.
Probst, P., & Boulesteix, A. L. (2017). To tune or not to tune the number of trees in random forest. The Journal of Machine Learning Research, 18(1), 6673-6690.
Rao, S., Verma, A. K., & Bhatia, T. (2021). A review on social spam detection: challenges, open issues, and future directions. Expert Systems with Applications, 186, 115742.
Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
Siğirci, İ. O., Özgür, H., Oluk, A., Uz, H., Çetiner, E., Oktay, H. U., & Erdemir, K. (2020, September). Sentiment Analysis of Turkish Reviews on Google Play Store. In 2020 5th International Conference on Computer Science and Engineering (UBMK) (pp. 314-315). IEEE.
Şahin, G., & Diri, B. (2021, June). The Effect of Transfer Learning on Turkish Text Classification. In 2021 29th Signal Processing and Communications Applications Conference (SIU) (pp. 1-4). IEEE.
Şimşek, H. & Aydemir, E. (2022). Classification of Unwanted E-Mails (Spam) with Turkish Text by Different Algorithms in Weka Program. Journal of Soft Computing and Artificial Intelligence, 3 (1), 1-10.
Taşar, B., Fatih, Ü. N. E. Ş., Demirci, M., & Kaya, Y. Z. (2018). Yapay sinir ağları yöntemi kullanılarak buharlaşma miktarı tahmini. Dicle Üniversitesi Mühendislik Fakültesi Mühendislik Dergisi, 9(1), 543-551.

Ayrıntılar

Birincil Dil

Türkçe

Konular

Mühendislik

Bölüm

Araştırma Makalesi

Yazarlar

Zekeriya Anıl Güven ^*
0000-0002-7025-2815
Türkiye

Yayımlanma Tarihi

31 Ocak 2023

Gönderilme Tarihi

14 Ocak 2023

Kabul Tarihi

25 Ocak 2023

Yayımlandığı Sayı

Yıl 2023 Sayı: 47

DOI

https://doi.org/10.31590/ejosat.1234079

IZ

https://izlik.org/JA46FR58CF

Kaynak Göster

RIS / Bibtex

APA

Güven, Z. A. (2023). Türkçe E-postalarda Spam Tespiti için Makine Öğrenme Yöntemlerinin ve Dil Modellerinin Analizi. Avrupa Bilim ve Teknoloji Dergisi, 47, 1-6. https://doi.org/10.31590/ejosat.1234079

AMA

1.Güven ZA. Türkçe E-postalarda Spam Tespiti için Makine Öğrenme Yöntemlerinin ve Dil Modellerinin Analizi. EJOSAT. 2023;(47):1-6. doi:10.31590/ejosat.1234079

Chicago

Güven, Zekeriya Anıl. 2023. “Türkçe E-postalarda Spam Tespiti için Makine Öğrenme Yöntemlerinin ve Dil Modellerinin Analizi”. Avrupa Bilim ve Teknoloji Dergisi, sy 47: 1-6. https://doi.org/10.31590/ejosat.1234079.

EndNote

Güven ZA (01 Ocak 2023) Türkçe E-postalarda Spam Tespiti için Makine Öğrenme Yöntemlerinin ve Dil Modellerinin Analizi. Avrupa Bilim ve Teknoloji Dergisi 47 1–6.

IEEE

[1]Z. A. Güven, “Türkçe E-postalarda Spam Tespiti için Makine Öğrenme Yöntemlerinin ve Dil Modellerinin Analizi”, EJOSAT, sy 47, ss. 1–6, Oca. 2023, doi: 10.31590/ejosat.1234079.

ISNAD

Güven, Zekeriya Anıl. “Türkçe E-postalarda Spam Tespiti için Makine Öğrenme Yöntemlerinin ve Dil Modellerinin Analizi”. Avrupa Bilim ve Teknoloji Dergisi. 47 (01 Ocak 2023): 1-6. https://doi.org/10.31590/ejosat.1234079.

JAMA

1.Güven ZA. Türkçe E-postalarda Spam Tespiti için Makine Öğrenme Yöntemlerinin ve Dil Modellerinin Analizi. EJOSAT. 2023;:1–6.

MLA

Güven, Zekeriya Anıl. “Türkçe E-postalarda Spam Tespiti için Makine Öğrenme Yöntemlerinin ve Dil Modellerinin Analizi”. Avrupa Bilim ve Teknoloji Dergisi, sy 47, Ocak 2023, ss. 1-6, doi:10.31590/ejosat.1234079.

Vancouver

1.Zekeriya Anıl Güven. Türkçe E-postalarda Spam Tespiti için Makine Öğrenme Yöntemlerinin ve Dil Modellerinin Analizi. EJOSAT. 01 Ocak 2023;(47):1-6. doi:10.31590/ejosat.1234079

Türkçe E-postalarda Spam Tespiti için Makine Öğrenme Yöntemlerinin ve Dil Modellerinin Analizi

Türkçe E-postalarda Spam Tespiti için Makine Öğrenme Yöntemlerinin ve Dil Modellerinin Analizi

Öz

Anahtar Kelimeler

Analysis of Machine Learning Methods and Language Models for Spam Detection in Turkish Emails

Öz

Anahtar Kelimeler

Kaynakça

Ayrıntılar

Birincil Dil

Konular

Bölüm

Yazarlar

Yayımlanma Tarihi

Gönderilme Tarihi

Kabul Tarihi

Yayımlandığı Sayı

DOI

IZ

Kaynak Göster

Cited By

Sağlık Kuruluşlarının Kurumsal İtibarının Metin Madenciliği ve Duygu Analizi ile Değerlendirilmesi

Artificial Intelligence-Based Automation of the Referral Process for Applications Submitted to CİMER

E-POSTA DOLANDIRICILIĞININ TESPİTİ İÇİN HİBRİT NAİVE BAYES VE DERİN ÖĞRENME YAKLAŞIMI