Research Article
BibTex RIS Cite

Türkçe spam tespiti bağlamında öğrenme tekniklerinin karşılaştırmalı analizi

Year 2024, Volume: 14 Issue: 1, 43 - 56, 07.07.2024
https://doi.org/10.55024/buyasambid.1501609

Abstract

Kısa Mesaj Servisi (SMS), milyarlarca insan tarafından cep telefonu aracılığıyla iletişim kurmak için kullanılan bir mobil mesajlaşma aracıdır. Ancak, uygun mesaj filtreleme tekniklerinin eksikliği nedeniyle, bu iletişim biçimi istenmeyen ve önemsiz mesajlara karşı savunmasızdır. Bu makalede, Adaptif Boosting (AdaBoost), Extreme Gradient Boosting (XGBoost), K-En Yakın Komşular (KNN), Karar Ağacı (DT), Rastgele Orman (RF), Multinominal Naïve Bayes (MNB), Lojistik Regresyon (LR) ve Destek Vektör Makineleri (DVM) gibi makine öğrenimi yöntemleri ile Evrişimli Sinir Ağları (CNN), Yapay Sinir Ağları (YSA) ve Uzun Kısa Süreli Bellek (LSTM) gibi derin öğrenme yöntemlerine dayalı SMS spam tespit yaklaşımları f-skor, doğruluk, duyarlılık, kesinlik ve her bir strateji için oluşturulan karışıklık matrisi açısından karşılaştırılmıştır. Çalışma, yöntemleri değerlendirmek için iki farklı ön işleme yöntemini iki farklı Türkçe SMS veri kümesi üzerinde test etmiştir. Bu çalışmanın amacı, Türkiye'deki spam filtreleme konusuna katkıda bulunmaktır. Sonuçlar, kullanılan iki veri kümesinin bir kombinasyonu olan BigTurkishSMS veri kümesi üzerinde en yüksek doğruluk değerlerinin birinci ön işleme yöntemi kullanılarak Destek Vektör Makinesi (%99,03) ve ikinci ön işleme yöntemi kullanılarak Lojistik Regresyon ve Rastgele Orman (%98,07) ile elde edildiğini göstermektedir. Makine öğrenimi algoritmalarının çoğunda olduğu gibi, veri setinin ikinci ön işlemesi derin öğrenme modellerinde üstün sonuçlar vermiştir. YSA modeli %97,41'lik bir skorla en yüksek doğruluğu elde etmiştir. Bu çalışma, Türkçe SMS veri kümeleri üzerinde makine öğrenimi ve derin öğrenme tekniklerinin bir karşılaştırmasını yaparak bu alanda çalışan araştırmacılar için değerli bilgiler sağlamaktadır.

Thanks

Yazar, Çakmak Z. ve Çifçi M.S.'ye veri setlerinin toplanması ve deneylerin gerçekleştirilmesindeki yardımları için teşekkür eder. Bu makale Uluslararası Bilişim Kongresi 2024 (IIC2024)'te sunulmuştur.

References

  • Al Maruf, A., Al Numan, A., Haque, M. M., Jidney, T. T., & Aung, Z. (2023, April). Ensembleapproach to classify spam SMS from Bengali text. In International Conference on Advances in Computing and Data Sciences (pp. 440-453). Cham: Springer Nature Switzerland.
  • Almeida, T. A., Hidalgo, J. M. G., & Yamakami, A. (2011, September). Contributions to the study of SMS spam filtering: new collection and results. In Proceedings of the 11th ACM symposium on Document engineering (pp. 259-262).
  • Arulprakash, M. (2021). Eshort message service spam detection and filtering using machine learning approach. Turkish Journal of Computer and Mathematics Education (TURCOMAT), 12(9), 721- 727.
  • Chen, Y. H., Huang, L., Wang, C. D., Fu, M., Huang, S. Q., Huang, J., Tan & Yan, C. (2022). Adversarial Spam Detector with Character Similarity Network. IEEE Transactions on Industrial Informatics, 19(3), 2541-2551. doi: 10.1109/TII.2022.3177726
  • Dierks, Z., (2023). Forecast of the smartphone user penetration rate in Turkey 2018-2024. Tech. rep., Statista.
  • Ergin, S., & Isik, S. (2014a, June). The assessment of feature selection methods on agglutinative language for spam email detection: A special case for Turkish. In 2014 IEEE International Symposium on Innovations in Intelligent Systems and Applications (INISTA) Proceedings (pp. 122-125). IEEE.
  • Ergin, S., & Isik, S. (2014b, June). The investigation on the effect of feature vector dimension for spam email detection with a new framework. In 2014 9th Iberian Conference on Information Systems and Technologies (CISTI) (pp. 1-4). IEEE.
  • Eryılmaz, E. E., Şahin, D. Ö., & Kılıç, E. (2020, June). Filtering Turkish spam using LSTM from deep learning techniques. In 2020 8th International Symposium on Digital Forensics and Security (ISDFS) (pp. 1-6). IEEE.
  • Gupta, M., Bakliwal, A., Agarwal, S., & Mehndiratta, P. (2018, August). A comparative study of spam SMS detection using machine learning classifiers. In 2018 eleventh international conference on contemporary computing (IC3) (pp. 1-7). IEEE.
  • Jain, G., Sharma, M., & Agarwal, B. (2019). Optimizing semantic LSTM for spam detection. International Journal of Information Technology, 11, 239-250. doi: 10.1007/s41870-018-0157-5
  • Karamollaoglu, H., Dogru, İ. A., & Dorterler, M. (2018, October). Detection of Spam E-mails with Machine Learning Methods. In 2018 Innovations in Intelligent Systems and Applications Conference (ASYU) (pp. 1-5). IEEE.
  • Karasoy, O., & Ballı, S. (2022). Spam SMS detection for Turkish language with deep text analysis and deep learning methods. Arabian Journal for Science and Engineering, 47(8), 9361-9377. doi: 10.1007/s13369-021-06187-1
  • Kaya, Y., & Ertuğrul, Ö. F. (2016). A novel feature extraction approach in SMS spam filtering for mobile communication: one‐dimensional ternary patterns. Security and communication networks, 9(17), 4680-4690. doi: 10.1002/sec.1660
  • Kemp, S., (2024). Digital 2024 global overview report. Tech. rep., Meltwater and We Are Social.
  • Masum, E., & Samet, R. (2018). Mobil BOTNET ile DDOS Saldırısı. Bilişim Teknolojileri Dergisi, 11(2), 111-121. doi: 10.17671/gazibtd.306612
  • Mathew, K., & Issac, B. (2011, December). Intelligent spam classification for mobile text message. In Proceedings of 2011 International Conference on Computer Science and Network Technology (Vol. 1, pp. 101-105). IEEE.
  • Matthew Shanahan, K.B., (2023). The state of mobile internet connectivity report. Tech. rep., GSMA Intelligence.
  • Örnek, Ö. (2019). Orange 3 ile Türkçe ve İngilizce SMS Mesajlarında Spam Tespiti. Eskişehir Türk Dünyası Uygulama ve Araştırma Merkezi Bilişim Dergisi, 1(1), 1-4.
  • Özdemir, C., Ataş, M., & Özer, A. B. (2013, April). Classification of Turkish spam e-mails with artificial immune system. In 2013 21st Signal Processing and Communications Applications Conference (SIU) (pp. 1-4). IEEE.
  • Roy, P. K., Singh, J. P., & Banerjee, S. (2020). Deep learning to filter SMS Spam. Future Generation Computer Systems, 102, 524-533. doi: 10.1016/j.future.2019.09.001
  • Sajedi, H., Parast, G. Z., & Akbari, F. (2016). SMS spam filtering using machine learning techniques: A survey. Machine Learning Research, 1(1), 1-14. doi: 10.11648/j.mlr.20160101.11
  • Salman, M., Ikram, M., & Kaafar, M. A. (2024). Investigating Evasive Techniques in SMS Spam Filtering: A Comparative Analysis of Machine Learning Models. IEEE Access, 12, 24306–24324. doi: 10.1109/ACCESS.2024.3364671
  • Suleiman, D., Al-Naymat, G., & Itriq, M. (2020). Deep SMS Spam Detection using H2O Platform. International Journal of Advanced Trends in Computer Science and Engineering, 9(5), 9179–9188. doi: 10.30534/ijatcse/2020/326952020
  • Theodorus, A., Prasetyo, T. K., Hartono, R., & Suhartono, D. (2021, April). Short message service (SMS) spam filtering using machine learning in Bahasa Indonesia. In 2021 3rd East Indonesia Conference on Computer and Information Technology (EIConCIT) (pp. 199-203). IEEE.
  • United Nations Department of Economic and Social Affairs, Population Division (2022). World Population Prospects 2022: Summary of Results. UN DESA/POP/2022/TR/NO. 3.
  • Uysal, A. K., Gunal, S., Ergin, S., & Gunal, E. S. (2013). The impact of feature extraction and selection on SMS spam filtering. Elektronika ir Elektrotechnika, 19(5), 67-72. doi: 10.5755/j01.eee.19.5.1829

A comparative analysis of learning techniques in the context of Turkish spam detection

Year 2024, Volume: 14 Issue: 1, 43 - 56, 07.07.2024
https://doi.org/10.55024/buyasambid.1501609

Abstract

Short Message Service (SMS) is a mobile messaging tool used by billions of people to communicate via a mobile phone. However, due to the lack of proper message filtering techniques, this form of communication is vulnerable to unwanted and junk messages. This paper compared SMS spam detection approaches based on machine learning methods such as Adaptive Boosting (AdaBoost), Extreme Gradient Boosting (XGBoost), K-Nearest Neighbors (KNN), Decision Tree (DT), Random Forest (RF), Multinominal Naïve Bayes (MNB), Logistic Regression (LR), and Support Vector Machines (SVM) and deep learning methods such as Convolutional Neural Networks (CNNs), Artificial Neural Networks (ANNs), and Long Short Term Memory (LSTM) in terms of f-score, accuracy, recall, precision, and a confusion matrix constructed for each strategy. The study tested two different preprocessing methods on two different Turkish SMS datasets to evaluate the approaches. The aim of this study is to contribute to the issue of spam filtering in Turkey. The results indicate that the highest accuracy values were achieved with Support Vector Machine (99.03%) using the first preprocessing method and Logistic Regression and Random Forest (98.07%) using the second preprocessing method on the BigTurkishSMS dataset, a combination of the two datasets used. As is the case with the majority of machine learning algorithms, the second preprocessing of the data set yielded superior results in deep learning models. The ANN model achieved the highest accuracy, with a score of 97.41%. The study employed a comparison of machine learning and deep learning techniques on Turkish SMS datasets, which will provide valuable insights for researchers working in this field.

Thanks

The author would like to thank Cakmak Z. and Cifci M.S. for their assistance in gathering datasets and conducting experiments. This paper was presented at the International Information Congress 2024 (IIC2024).

References

  • Al Maruf, A., Al Numan, A., Haque, M. M., Jidney, T. T., & Aung, Z. (2023, April). Ensembleapproach to classify spam SMS from Bengali text. In International Conference on Advances in Computing and Data Sciences (pp. 440-453). Cham: Springer Nature Switzerland.
  • Almeida, T. A., Hidalgo, J. M. G., & Yamakami, A. (2011, September). Contributions to the study of SMS spam filtering: new collection and results. In Proceedings of the 11th ACM symposium on Document engineering (pp. 259-262).
  • Arulprakash, M. (2021). Eshort message service spam detection and filtering using machine learning approach. Turkish Journal of Computer and Mathematics Education (TURCOMAT), 12(9), 721- 727.
  • Chen, Y. H., Huang, L., Wang, C. D., Fu, M., Huang, S. Q., Huang, J., Tan & Yan, C. (2022). Adversarial Spam Detector with Character Similarity Network. IEEE Transactions on Industrial Informatics, 19(3), 2541-2551. doi: 10.1109/TII.2022.3177726
  • Dierks, Z., (2023). Forecast of the smartphone user penetration rate in Turkey 2018-2024. Tech. rep., Statista.
  • Ergin, S., & Isik, S. (2014a, June). The assessment of feature selection methods on agglutinative language for spam email detection: A special case for Turkish. In 2014 IEEE International Symposium on Innovations in Intelligent Systems and Applications (INISTA) Proceedings (pp. 122-125). IEEE.
  • Ergin, S., & Isik, S. (2014b, June). The investigation on the effect of feature vector dimension for spam email detection with a new framework. In 2014 9th Iberian Conference on Information Systems and Technologies (CISTI) (pp. 1-4). IEEE.
  • Eryılmaz, E. E., Şahin, D. Ö., & Kılıç, E. (2020, June). Filtering Turkish spam using LSTM from deep learning techniques. In 2020 8th International Symposium on Digital Forensics and Security (ISDFS) (pp. 1-6). IEEE.
  • Gupta, M., Bakliwal, A., Agarwal, S., & Mehndiratta, P. (2018, August). A comparative study of spam SMS detection using machine learning classifiers. In 2018 eleventh international conference on contemporary computing (IC3) (pp. 1-7). IEEE.
  • Jain, G., Sharma, M., & Agarwal, B. (2019). Optimizing semantic LSTM for spam detection. International Journal of Information Technology, 11, 239-250. doi: 10.1007/s41870-018-0157-5
  • Karamollaoglu, H., Dogru, İ. A., & Dorterler, M. (2018, October). Detection of Spam E-mails with Machine Learning Methods. In 2018 Innovations in Intelligent Systems and Applications Conference (ASYU) (pp. 1-5). IEEE.
  • Karasoy, O., & Ballı, S. (2022). Spam SMS detection for Turkish language with deep text analysis and deep learning methods. Arabian Journal for Science and Engineering, 47(8), 9361-9377. doi: 10.1007/s13369-021-06187-1
  • Kaya, Y., & Ertuğrul, Ö. F. (2016). A novel feature extraction approach in SMS spam filtering for mobile communication: one‐dimensional ternary patterns. Security and communication networks, 9(17), 4680-4690. doi: 10.1002/sec.1660
  • Kemp, S., (2024). Digital 2024 global overview report. Tech. rep., Meltwater and We Are Social.
  • Masum, E., & Samet, R. (2018). Mobil BOTNET ile DDOS Saldırısı. Bilişim Teknolojileri Dergisi, 11(2), 111-121. doi: 10.17671/gazibtd.306612
  • Mathew, K., & Issac, B. (2011, December). Intelligent spam classification for mobile text message. In Proceedings of 2011 International Conference on Computer Science and Network Technology (Vol. 1, pp. 101-105). IEEE.
  • Matthew Shanahan, K.B., (2023). The state of mobile internet connectivity report. Tech. rep., GSMA Intelligence.
  • Örnek, Ö. (2019). Orange 3 ile Türkçe ve İngilizce SMS Mesajlarında Spam Tespiti. Eskişehir Türk Dünyası Uygulama ve Araştırma Merkezi Bilişim Dergisi, 1(1), 1-4.
  • Özdemir, C., Ataş, M., & Özer, A. B. (2013, April). Classification of Turkish spam e-mails with artificial immune system. In 2013 21st Signal Processing and Communications Applications Conference (SIU) (pp. 1-4). IEEE.
  • Roy, P. K., Singh, J. P., & Banerjee, S. (2020). Deep learning to filter SMS Spam. Future Generation Computer Systems, 102, 524-533. doi: 10.1016/j.future.2019.09.001
  • Sajedi, H., Parast, G. Z., & Akbari, F. (2016). SMS spam filtering using machine learning techniques: A survey. Machine Learning Research, 1(1), 1-14. doi: 10.11648/j.mlr.20160101.11
  • Salman, M., Ikram, M., & Kaafar, M. A. (2024). Investigating Evasive Techniques in SMS Spam Filtering: A Comparative Analysis of Machine Learning Models. IEEE Access, 12, 24306–24324. doi: 10.1109/ACCESS.2024.3364671
  • Suleiman, D., Al-Naymat, G., & Itriq, M. (2020). Deep SMS Spam Detection using H2O Platform. International Journal of Advanced Trends in Computer Science and Engineering, 9(5), 9179–9188. doi: 10.30534/ijatcse/2020/326952020
  • Theodorus, A., Prasetyo, T. K., Hartono, R., & Suhartono, D. (2021, April). Short message service (SMS) spam filtering using machine learning in Bahasa Indonesia. In 2021 3rd East Indonesia Conference on Computer and Information Technology (EIConCIT) (pp. 199-203). IEEE.
  • United Nations Department of Economic and Social Affairs, Population Division (2022). World Population Prospects 2022: Summary of Results. UN DESA/POP/2022/TR/NO. 3.
  • Uysal, A. K., Gunal, S., Ergin, S., & Gunal, E. S. (2013). The impact of feature extraction and selection on SMS spam filtering. Elektronika ir Elektrotechnika, 19(5), 67-72. doi: 10.5755/j01.eee.19.5.1829
There are 26 citations in total.

Details

Primary Language English
Subjects Applied Computing (Other)
Journal Section Research Article
Authors

Öznur Şengel 0000-0002-2186-927X

Publication Date July 7, 2024
Submission Date June 15, 2024
Acceptance Date June 25, 2024
Published in Issue Year 2024 Volume: 14 Issue: 1

Cite

APA Şengel, Ö. (2024). A comparative analysis of learning techniques in the context of Turkish spam detection. Batman Üniversitesi Yaşam Bilimleri Dergisi, 14(1), 43-56. https://doi.org/10.55024/buyasambid.1501609