TY - JOUR T1 - Spam Mail Detection in Turkish and English Languages: A Holistic Study of AI-based Techniques including Individual, Ensemble and Hybrid Approaches TT - Türkçe ve İngilizce Dillerinde Spam Posta Tespiti: Bireysel, Toplu ve Hibrit Yaklaşımları İçeren Yapay Zeka Tabanlı Tekniklerin Bütünsel Bir Çalışması AU - Eroğlu, Alperen AU - Candan, Esma Nisa AU - Küçükilhan, Rehnüma PY - 2025 DA - August Y2 - 2024 JF - Necmettin Erbakan Üniversitesi Fen ve Mühendislik Bilimleri Dergisi JO - NEU Fen Muh Bil Der PB - Necmettin Erbakan University WT - DergiPark SN - 2667-7989 SP - 189 EP - 205 VL - 7 IS - 2 LA - en AB - Spam has surged due to increased email and social media use, posing a critical challenge in effectively detecting and classifying this growing volume without causing harm to systems. This paper presents a holistic strategy to analyze and reveal the most efficient approaches for detecting and classifying e-mails as spam or ham by using Turkish and English datasets. We use two different datasets generated in different languages in addition to conjunctively generated new datasets. We make a comparative study to find out the best spam mail detection approaches based on our enhanced machine learning and deep learning methods. We also bring ensemble and hybrid learning methods together as a new approach for spam mail detection. We utilize natural language processing, and improved learning algorithms with optimized feature selection approaches and preprocessing. We compare various methods commonly used in the literature which are Multinomial Naive Bayes, Support Vector Machine, Logistic Regression, K-Nearest Neighbors, Decision Tree, Random Forest, Voting classifier, and Stacking classifier as machine learning algorithms, and Long Short Term Memory, Bidirectional Long Short Term Memory, Bidirectional Encoder Representations from Transformers as deep learning algorithms. We split the datasets as train data and test data with the 80:20 ratios in addition to 5-fold cross-validation for each model. We also optimize the hyperparameters of our models by using Grid Search. The ensemble method based on machine learning approaches provides the best performances which are the percentage of 99.9% for the English Enron dataset, and the hybrid ensemble approach based on simple average yields the best accuracy value of 98.43% for the Turkish dataset from UCI and Kaggle. KW - English Datasets KW - Ensemble Learning KW - Hybrid Learning KW - Turkish Datasets KW - Spam Mail N2 - Artan e-posta ve sosyal medya kullanımı nedeniyle spam sayısı artmış ve bu durumun sistemlere zarar vermeden etkili bir şekilde tespit edilmesi ve sınıflandırılması konusunda kritik bir zorluk oluşturmuştur. Bu makale, Türkçe ve İngilizce veri kümelerini kullanarak e-postaları spam veya ham olarak tespit etmek ve sınıflandırmak için en etkili yaklaşımları analiz etmek ve ortaya çıkarmak için bütünsel bir strateji sunmaktadır. Birleşik olarak oluşturulan yeni veri kümelerine ek olarak, farklı dillerde oluşturulan iki farklı veri kümesi kullanılmaktadır. Gelişmiş makine öğrenmesi ve derin öğrenme yaklaşımlarını temel alarak en iyi spam posta algılama yöntemlerini sunmak için karşılaştırmalı bir çalışma yapılmaktadır. Ayrıca yeni bir yaklaşım olarak spam posta tespiti için toplu ve hibrit öğrenme yöntemleri bir araya getirilmiştir. Optimize edilmiş özellik seçimi yaklaşımları ve ön işleme ile doğal dil işlemeyi ve geliştirilmiş öğrenme algoritmaları kullanılmaktadır. Literatürde yaygın olarak kullanılan Multinomial Naive Bayes, Destek Vektör Makinesi, Lojistik Regresyon, K-En Yakın Komşular, Karar Ağacı, Rastgele Orman, Oylama sınıflandırıcısı ve makine öğrenme algoritmaları olarak Yığınlama Sınıflandırıcısı ile Uzun Kısa Süreli Bellek, Çift Yönlü yöntemlerini karşılaştırmaktayız. Uzun Kısa Süreli Bellek, Transformatörlerden Çift Yönlü Kodlayıcı Gösterimleri ise derin öğrenme algoritmaları olarak kullanılmaktadır. 5 kat çapraz doğrulamaya ek olarak, veri kümeleri her model için 80:20 oranlarıyla eğitim verileri ve test verileri olarak bölünmüştür. Izgara Arama tekniği kullanılarak modellerin hiper parametreleri de optimize edilmektedir. Makine öğrenmesi yaklaşımlarına dayalı toplu öğrenme yöntemi, İngilizce Enron veri seti için %99,9 ile en iyi performansı sağlarken, basit ortalamaya dayalı hibrit toplu öğrenme yaklaşımı, UCI ve Kaggle'dan Türkçe veri seti için %98,43 ile en iyi doğruluk değerini vermektedir. CR - J. Doshi, K. Parmar, R. Sanghavi, N. Shekokar, A comprehensive dual-layer architecture for phishing and spam email detection, Computer & Security. 133 (2023), 103378. doi:10.1016/j.cose.2023.103378 CR - N. Saidani, K. Adi, MS. Allili, A semantic-based classification approach for an enhanced spam detection. Computer & Security. 94 (2020), 101716. doi:10.1016/j.cose.2020.101716 CR - B. Feng, Q. Fu, M. Dong, D. Guo, Q. Li, Multistage and elastic spam detection in mobile social networks through deep learning, IEEE Network. 32(4) (2018), 15-21. doi:10.1109/MNET.2018.1700406 CR - A. Karim, S. Azam, B. Shanmugam, K. Kannoorpatti, M. Alazab, A comprehensive survey for intelligent spam email detection, IEEE Access. 7 (2019), 168261-168295. doi:10.1109/ACCESS.2019.2954791 CR - S. Gibson, B. Issac, L. Zhang, SM. Jacob, Detecting spam email with machine learning optimized with bio-inspired metaheuristic algorithms, IEEE Access. 8 (2020), 187914-187932. doi:10.1109/ACCESS.2020.3030751 CR - S. Rapacz, P. Chołda, M. Natkaniec, A method for fast selection of machine-learning classifiers for spam filtering, Electronics. 10(17) (2021), 2083. doi:10.3390/electronics10172083 CR - S. Magdy, Y. Abouelseoud, M. Mikhail, Efficient spam and phishing email filtering based on deep learning, Computer Networks. 206 (2022), 108826. doi:10.1016/j.comnet.2022.108826 CR - F. Ozen, R. Ortac Kabaoglu, T. V. Mumcu, Deep Learning Based Temperature and Humidity Prediction, Necmettin Erbakan University Journal of Science and Engineering. 5(2) (2023). 219-229. doi:10.47112/neufmbd.2023.20 CR - M. Hacıbeyoglu, M. Çelik, Ö. Erdaş Çiçek, Energy Efficiency Estimation in Buildings with K Nearest Neighbor Algorithm, Necmettin Erbakan University Journal of Science and Engineering, 5(2) (2023), 65-74. doi:10.47112/neufmbd.2023.10 CR - A. Pektaş, O. İnan, Application of Tree Seed Algorithm on Clustering Problems, Necmettin Erbakan University Journal of Science and Engineering. 4(1) (2022), 1-10. doi:10.47112/neufmbd.2022.8 CR - F. Rustam, N. Saher, A. Mehmood, E. Lee, S. Washington, I. Ashraf, Detecting ham and spam emails using feature union and supervised machine learning models, Multimedia Tools and Applications. 82 (2023), 26545–26561. doi: 10.1007/s11042-023-14814-2 CR - E. E. Eryılmaz, D. Ö. Şahin, E. Kılıç, Türkçe İstenmeyen E-postaların Farklı Öznitelik Seçim Yöntemleri Kullanılarak Makine Öğrenmesi Algoritmaları ile Tespit Edilmesi, Türkiye Bilişim Vakfı-Bilgisayar Bilimleri ve Mühendisliği Dergisi. 13 (2) (2020), 57-77. CR - M. A. Shaaban, Y. F. Hassan, S. K. Guirguis, Deep convolutional forest: a dynamic deep ensemble approach for spam detection in text, Complex & Intelligent Systems. 8(6) (2022), 4897-4909. doi:10.1007/s40747-022-00741-6 CR - S. Kaddoura, O. Alfandi, N. Dahmani, A spam email detection mechanism for English language text emails using a deep learning approach, In: 29th International Conference on Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE), IEEE, Bayonne, France, 2020: 193-198. doi:10.1109/WETICE49692.2020.00045 CR - T. Toma, S. Hassan, M. Arifuzzaman, An analysis of supervised machine learning algorithms for spam email detection, In: International Conference on Automation, Control, and Mechatronics for Industry 4.0 (ACMI), IEEE, Rajshahi, Bangladesh, 2021: 1-5. doi:10.1109/ACMI53878.2021.9528108 CR - C. M. Shaik, N. M. Penumaka, S. K. Abbireddy, V. Kumar, S. Aravinth, Bi-LSTM and conventional classifiers for email spam filtering, In: Third International Conference on Artificial Intelligence and Smart Energy (ICAIS), IEEE, Coimbatore, India, 2023: 1350-1355. doi:10.1109/ICAIS56108.2023.10073776 CR - K. Debnath, N. Kar. Email spam detection using deep learning approach, In: International Conference on Machine Learning, Big Data, Cloud and Parallel Computing (COM-IT-CON), IEEE, Faridabad, India, 2022: 37-41. doi:10.1109/COM-IT-CON54601.2022.9850588 CR - A. R. Yeruva, D. Kamboj, P. Shankar, U. S. Aswal, A. K. Rao, C. Somu, E-mail spam detection using machine learning—KNN, In: 5th International Conference on Contemporary Computing and Informatics (IC3I), IEEE, Uttar Pradesh, India, 2022: 1024-1028. doi:10.1109/IC3I56241.2022.10072628 CR - N. Ahmed, R. Amin, H. Aldabbas, D. Koundal, B. Alouffi, T. Shah, Machine learning techniques for spam detection in email and IoT platforms: Analysis and research challenges, Security and Communication Networks. (1) (2022), 1-19. doi:10.1155/2024/7538203 CR - Z. B. Siddique, M. A. Khan, I. U. Din, A. Almogren, I. Mohiuddin, S. Nazir, Machine learning-based detection of spam emails, Scientific Programming. (1) (2021), 1-11. doi:10.1155/2021/6508784 CR - Z. Hassani, V. Hajihashemi, K. Borna, I. S. Dehmajnoonie, A classification method for E-mail spam using a hybrid approach for feature selection optimization, Journal of Sciences, Islamic Republic of Iran. 31(2), (2020), 165-173. CR - A. Sheneamer, Comparison of deep and traditional learning methods for email spam filtering, International Journal of Advanced Computer Science and Applications (IJACSA). 12 (1) (2021), 560-565. doi: 10.14569/IJACSA.2021.0120164 CR - S. Zavrak, S. Yilmaz, Email spam detection using hierarchical attention hybrid deep learning method, Expert Systems with Applications. 233 (2023), 120977. doi: 10.1016/j.eswa.2023.120977 CR - G. Hnini, J. Riffi, M. A. Mahraz, A. Yahyaouy, H. Tairi, MMPC-RF: a deep multimodal feature-level fusion architecture for hybrid spam E-mail detection, Applied Sciences. 11(24) (2021), 11968. doi: 10.3390/app112411968 CR - A. I. Taloba, S. S. Ismail, An intelligent hybrid technique of decision tree and genetic algorithm for e-mail spam detection, In: IEEE 2019 Ninth International Conference on Intelligent Computing and Information Systems (ICICIS), IEEE, Cairo, Egypt, 2019: 99-104. doi:10.1109/ICICIS46948.2019.9014756 CR - K. Meena, S. R. Upadhyaya, A Privacy-Preserving machine learning ensemble for spam detection, In: IEEE 5th International Conference on Inventive Research in Computing Applications (ICIRCA), IEEE, Coimbatore, India, 2023: 255-259. CR - M. Sumathi, S. Raja, Machine learning algorithm-based spam detection in social networks, Social Network Analysis and Mining. 13(1) (2023), 104. doi:10.1007/s13278-023-01108-6 CR - M. Hina, M. Ali, A. R. Javed, F. Ghabban, L. A. Khan, Z. Jalil, SeFACED: Semantic-based forensic analysis and classification of e-mail data using deep learning, IEEE Access. 9 (2021), 98398-98411. doi:10.1109/ACCESS.2021.3095730 CR - S. Xu, Y. Li, W. Zheng, Bayesian multinomial naïve bayes classifier to text classification, In: International Conference on Multimedia and Ubiquitous Engineering, Springer, Singapore 2017: 347–352. doi:10.1007/978-981-10-5041-1_57 CR - R. O. Olanrewaju, S. A. Olanrewaju, L. A. Nafiu, Multinomial naïve bayes classifier: bayesian versus nonparametric classifier approach, European Journal of Statistics. 2 (8) (2022), 1-13. doi:10.28924/ada/stat.2.8 CR - U. K. B. Saravanan, M. Vijay, T. Shreedhar, G. Rajasekar, R. Yashwanth, P. Shakthipriya, Multinomial Naive Bayes Based Machine Learning Analysis of Twitter Sentiment, In: IEEE 2nd International Conference on Edge Computing and Applications (ICECAA). Namakkal, India, 2023: 429-434. doi:10.1109/ICECAA58104.2023.10212150. CR - Y. K. Zamil, S. A. Ali, M. A. Naser, Spam image email filtering using k-nn and svm, International Journal of Electrical and Computer Engineering. 9(1) (2019), 245-254. doi:10.11591/ijece.v9i1. 245-254. CR - B. Trstenjak, S. Mikac, D. Donko, Knn with tf-idf based framework for text categorization, Procedia Engineering. 69 (2014), 1356-1364. doi:10.1016/j.proeng.2014.03.129 CR - Z. Yong, L. Youwen, X. Shixiong, An improved knn text classification algorithm based on clustering, Journal of Computers. 4(3) (2009), 230-237. doi:10.4304/jcp.4.3.230-237 CR - S. S. Ismail, R. F. Mansour, A. El-Aziz, M. Rasha, A. I. Taloba, Efficient e-mail spam detection strategy using genetic decision tree processing with NLP features, Computational Intelligence and Neuroscience. (2022), 1-16. doi:10.1155/2022/7710005. CR - D. Jurafsky, J. H. Martin, Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition with Language Models, (2024). https://web.stanford.edu/~jurafsky/slp3/5.pdf (accessed 21 September 2024). CR - S. Jamshidi, M. Mohammadi, S. Bagheri, H. E. Najafabadi, A. Rezvanian, M. Gheisari, M. Ghaderzadeh, A. S. Shahabi, Z. Wu, Effective text classification using BERT, MTM LSTM, and DT. Data & Knowledge Engineering. 151 (2024), 102306. doi:10.1016/j.datak.2024.102306. CR - Y. Yu, X. Si, C. Hu, J. Zhang, A review of recurrent neural networks: LSTM cells and network architectures, Neural Computation. 31(7) (2019), 1235-1270. doi:10.1162/neco a 01199 CR - A. Purwarianti, I. A. P. A. Crisdayanti, Improving Bi-LSTM performance for Indonesian sentiment analysis using para44 graph vector, In: IEEE 2019 International Conference of Advanced Informatics: Concepts, Theory and Applications (ICAICTA), Bandung, Indonesia, 2019: 1-5. doi:10.1109/ICAICTA.2019.8904199. CR - Y. Xiong, N. Wei, K. Qiao, Z. Li and Z. Li, Exploring Consumption Intent in Live E-Commerce Barrage: A Text Feature-Based Approach Using BERT-BiLSTM Model, IEEE Access, 12 (2024), 69288-69298. doi: 10.1109/ACCESS.2024.3399095. CR - J. Wallat, F. Beringer, A. Anand, V. Anand, Probing BERT for Ranking Abilities. In: Kamps, J., et al. Advances in Information Retrieval. ECIR 2023. Lecture Notes in Computer Science, Springer, Cham, 2024: 13981. doi:10.1007/978-3-031-28238-6_17 CR - B. Aytan, C. O. Sakar, Comparison of transformer-based models trained in turkish and different languages on turkish natural language processing problems, In: 2022 30th Signal Processing and Communications Applications Conference (SIU), Safranbolu, Turkey, 2022: 1-4. doi:10.1109/SIU55565.2022.9864818 CR - E. Corp, W. W. Cohen. Enron Email Dataset, (2015). https://www.loc.gov/item/2018487913/ (accessed 23 September 2024). CR - H. Simsek, E. Aydemir. Classification of unwanted e-mails (spam) with turkish text by different algorithms in weka program, Journal of Soft Computing and Artificial Intelligence, 3 (2022), 1-4. doi:10.55195/jscai.1104694 CR - UCI Machine Learning Repository. Turkish Spam V01 dataset, (2019). https://archive.ics.uci.edu/dataset/530/turkish+spam+v01 [accessed 15 December 2023]. CR - W. Qader, M. Ameen, B. Ahmed, An overview of bag of words;importance, implementation, applications, and challenges, In: 2019 International Engineering Conference (IEC), Erbil, Iraq, 2019: 200-204. doi:10.1109/IEC47844.2019.8950616 CR - L. Havrlant, V. Kreinovich, A simple probabilistic explanation of term frequency-inverse document frequency (tf-idf) heuristic (and variations motivated by this explanation), International Journal of General Systems, 46 (2017), 27-36. doi:10.1080/03081079.2017.1291635 CR - A. Jalilifard, V. F. Carida, A. F. Mansano, R. S. Cristo, F. P. C. Fonseca, Semantic sensitive tf-idf to determine word relevance in documents, In: Advances in Computing and Network Communications, 2021: 327–337. doi:10.1007/978-981-33-6987-0 CR - F. Zhang, W. Song, Product improvement in a big data environment: A novel method based on text mining and large group decision making, Expert Systems with Applications, 245 (2024), 123015, doi:10.1016/j.eswa.2023.123015. CR - J. Pennington, R. Socher, C. Manning, Glove: global vectors for word representation, In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 2014: 1532–1543. doi:10.3115/v1/D14-1162 CR - Z. Hua, Y. Tong, Y. Zheng, Y. Li, and Y. Zhang, PPGlove: Privacy-Preserving Glove for Training Word Vectors in the Dark, IEEE Transactions on Information Forensics and Security. 19 (2024), 3644-3658. doi:10.1109/TIFS.2024.3364080 CR - P. Bountakas, C. Xenakis, HELPHED: Hybrid ensemble learning phishing email detection, Journal of Network and Computer Applications. 210 (2023), 103545. doi:10.1016/j.jnca.2022.103545 CR - O. Sagi, L. Rokach, Ensemble learning: a survey, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery. 8(4) (2018), e1249. doi:10.1002/widm.1249 CR - G. Wang, J. Sun, J. Ma, K. Xu, J. Gu, Sentiment classification: the contribution of ensemble learning, Decision Support Systems. 57 (2014), 77-93. doi:10.1016/j.dss.2013.08.002 CR - M. A. Ganaie, M. Hu, A. K. Malik, M. Tanveer, P. N. Suganthan, Ensemble deep learning: a review, Engineering Applications of Artificial Intelligence. 115 (2022), 105151. doi:10.1016/j.engappai.2022.105151. CR - N. C. Yang, K. L. Sung, Non-intrusive load classification and recognition using soft-voting ensemble learning algorithm with decision tree, k-nearest neighbor algorithm and multilayer perceptron, IEEE Access. 11 (2023), 94506-94520. doi:/10.1109/ACCESS.2023.3311641 CR - A. Ghourabi, M. Alohaly, Enhancing spam message classification and detection using transformer-based embedding and ensemble learning, Sensors. 23 (8) (2023), 3861. doi:10.3390/s23083861 UR - https://dergipark.org.tr/en/pub/neufmbd/issue//1555903 L1 - https://dergipark.org.tr/en/download/article-file/4240166 ER -