Spam Mail Detection in Turkish and English Languages: A Holistic Study of AI-based Techniques including Individual, Ensemble and Hybrid Approaches

Esma Nisa Candan; Rehnüma Küçükilhan; Alperen Eroğlu

Research Article

Spam Mail Detection in Turkish and English Languages: A Holistic Study of AI-based Techniques including Individual, Ensemble and Hybrid Approaches

Year 2025, Volume: 7 Issue: 2, 189 - 205, 31.08.2025

Esma Nisa Candan Rehnüma Küçükilhan Alperen Eroğlu

Abstract

Spam has surged due to increased email and social media use, posing a critical challenge in effectively detecting and classifying this growing volume without causing harm to systems. This paper presents a holistic strategy to analyze and reveal the most efficient approaches for detecting and classifying e-mails as spam or ham by using Turkish and English datasets. We use two different datasets generated in different languages in addition to conjunctively generated new datasets. We make a comparative study to find out the best spam mail detection approaches based on our enhanced machine learning and deep learning methods. We also bring ensemble and hybrid learning methods together as a new approach for spam mail detection. We utilize natural language processing, and improved learning algorithms with optimized feature selection approaches and preprocessing. We compare various methods commonly used in the literature which are Multinomial Naive Bayes, Support Vector Machine, Logistic Regression, K-Nearest Neighbors, Decision Tree, Random Forest, Voting classifier, and Stacking classifier as machine learning algorithms, and Long Short Term Memory, Bidirectional Long Short Term Memory, Bidirectional Encoder Representations from Transformers as deep learning algorithms. We split the datasets as train data and test data with the 80:20 ratios in addition to 5-fold cross-validation for each model. We also optimize the hyperparameters of our models by using Grid Search. The ensemble method based on machine learning approaches provides the best performances which are the percentage of 99.9% for the English Enron dataset, and the hybrid ensemble approach based on simple average yields the best accuracy value of 98.43% for the Turkish dataset from UCI and Kaggle.

Keywords

English Datasets , Ensemble Learning , Hybrid Learning , Turkish Datasets , Spam Mail

References

J. Doshi, K. Parmar, R. Sanghavi, N. Shekokar, A comprehensive dual-layer architecture for phishing and spam email detection, Computer & Security. 133 (2023), 103378. doi:10.1016/j.cose.2023.103378
N. Saidani, K. Adi, MS. Allili, A semantic-based classification approach for an enhanced spam detection. Computer & Security. 94 (2020), 101716. doi:10.1016/j.cose.2020.101716
B. Feng, Q. Fu, M. Dong, D. Guo, Q. Li, Multistage and elastic spam detection in mobile social networks through deep learning, IEEE Network. 32(4) (2018), 15-21. doi:10.1109/MNET.2018.1700406
A. Karim, S. Azam, B. Shanmugam, K. Kannoorpatti, M. Alazab, A comprehensive survey for intelligent spam email detection, IEEE Access. 7 (2019), 168261-168295. doi:10.1109/ACCESS.2019.2954791
S. Gibson, B. Issac, L. Zhang, SM. Jacob, Detecting spam email with machine learning optimized with bio-inspired metaheuristic algorithms, IEEE Access. 8 (2020), 187914-187932. doi:10.1109/ACCESS.2020.3030751
S. Rapacz, P. Chołda, M. Natkaniec, A method for fast selection of machine-learning classifiers for spam filtering, Electronics. 10(17) (2021), 2083. doi:10.3390/electronics10172083
S. Magdy, Y. Abouelseoud, M. Mikhail, Efficient spam and phishing email filtering based on deep learning, Computer Networks. 206 (2022), 108826. doi:10.1016/j.comnet.2022.108826
F. Ozen, R. Ortac Kabaoglu, T. V. Mumcu, Deep Learning Based Temperature and Humidity Prediction, Necmettin Erbakan University Journal of Science and Engineering. 5(2) (2023). 219-229. doi:10.47112/neufmbd.2023.20
M. Hacıbeyoglu, M. Çelik, Ö. Erdaş Çiçek, Energy Efficiency Estimation in Buildings with K Nearest Neighbor Algorithm, Necmettin Erbakan University Journal of Science and Engineering, 5(2) (2023), 65-74. doi:10.47112/neufmbd.2023.10
A. Pektaş, O. İnan, Application of Tree Seed Algorithm on Clustering Problems, Necmettin Erbakan University Journal of Science and Engineering. 4(1) (2022), 1-10. doi:10.47112/neufmbd.2022.8
F. Rustam, N. Saher, A. Mehmood, E. Lee, S. Washington, I. Ashraf, Detecting ham and spam emails using feature union and supervised machine learning models, Multimedia Tools and Applications. 82 (2023), 26545–26561. doi: 10.1007/s11042-023-14814-2
E. E. Eryılmaz, D. Ö. Şahin, E. Kılıç, Türkçe İstenmeyen E-postaların Farklı Öznitelik Seçim Yöntemleri Kullanılarak Makine Öğrenmesi Algoritmaları ile Tespit Edilmesi, Türkiye Bilişim Vakfı-Bilgisayar Bilimleri ve Mühendisliği Dergisi. 13 (2) (2020), 57-77.
M. A. Shaaban, Y. F. Hassan, S. K. Guirguis, Deep convolutional forest: a dynamic deep ensemble approach for spam detection in text, Complex & Intelligent Systems. 8(6) (2022), 4897-4909. doi:10.1007/s40747-022-00741-6
S. Kaddoura, O. Alfandi, N. Dahmani, A spam email detection mechanism for English language text emails using a deep learning approach, In: 29th International Conference on Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE), IEEE, Bayonne, France, 2020: 193-198. doi:10.1109/WETICE49692.2020.00045
T. Toma, S. Hassan, M. Arifuzzaman, An analysis of supervised machine learning algorithms for spam email detection, In: International Conference on Automation, Control, and Mechatronics for Industry 4.0 (ACMI), IEEE, Rajshahi, Bangladesh, 2021: 1-5. doi:10.1109/ACMI53878.2021.9528108
C. M. Shaik, N. M. Penumaka, S. K. Abbireddy, V. Kumar, S. Aravinth, Bi-LSTM and conventional classifiers for email spam filtering, In: Third International Conference on Artificial Intelligence and Smart Energy (ICAIS), IEEE, Coimbatore, India, 2023: 1350-1355. doi:10.1109/ICAIS56108.2023.10073776
K. Debnath, N. Kar. Email spam detection using deep learning approach, In: International Conference on Machine Learning, Big Data, Cloud and Parallel Computing (COM-IT-CON), IEEE, Faridabad, India, 2022: 37-41. doi:10.1109/COM-IT-CON54601.2022.9850588
A. R. Yeruva, D. Kamboj, P. Shankar, U. S. Aswal, A. K. Rao, C. Somu, E-mail spam detection using machine learning—KNN, In: 5th International Conference on Contemporary Computing and Informatics (IC3I), IEEE, Uttar Pradesh, India, 2022: 1024-1028. doi:10.1109/IC3I56241.2022.10072628
N. Ahmed, R. Amin, H. Aldabbas, D. Koundal, B. Alouffi, T. Shah, Machine learning techniques for spam detection in email and IoT platforms: Analysis and research challenges, Security and Communication Networks. (1) (2022), 1-19. doi:10.1155/2024/7538203
Z. B. Siddique, M. A. Khan, I. U. Din, A. Almogren, I. Mohiuddin, S. Nazir, Machine learning-based detection of spam emails, Scientific Programming. (1) (2021), 1-11. doi:10.1155/2021/6508784
Z. Hassani, V. Hajihashemi, K. Borna, I. S. Dehmajnoonie, A classification method for E-mail spam using a hybrid approach for feature selection optimization, Journal of Sciences, Islamic Republic of Iran. 31(2), (2020), 165-173.
A. Sheneamer, Comparison of deep and traditional learning methods for email spam filtering, International Journal of Advanced Computer Science and Applications (IJACSA). 12 (1) (2021), 560-565. doi: 10.14569/IJACSA.2021.0120164
S. Zavrak, S. Yilmaz, Email spam detection using hierarchical attention hybrid deep learning method, Expert Systems with Applications. 233 (2023), 120977. doi: 10.1016/j.eswa.2023.120977
G. Hnini, J. Riffi, M. A. Mahraz, A. Yahyaouy, H. Tairi, MMPC-RF: a deep multimodal feature-level fusion architecture for hybrid spam E-mail detection, Applied Sciences. 11(24) (2021), 11968. doi: 10.3390/app112411968
A. I. Taloba, S. S. Ismail, An intelligent hybrid technique of decision tree and genetic algorithm for e-mail spam detection, In: IEEE 2019 Ninth International Conference on Intelligent Computing and Information Systems (ICICIS), IEEE, Cairo, Egypt, 2019: 99-104. doi:10.1109/ICICIS46948.2019.9014756
K. Meena, S. R. Upadhyaya, A Privacy-Preserving machine learning ensemble for spam detection, In: IEEE 5th International Conference on Inventive Research in Computing Applications (ICIRCA), IEEE, Coimbatore, India, 2023: 255-259.
M. Sumathi, S. Raja, Machine learning algorithm-based spam detection in social networks, Social Network Analysis and Mining. 13(1) (2023), 104. doi:10.1007/s13278-023-01108-6
M. Hina, M. Ali, A. R. Javed, F. Ghabban, L. A. Khan, Z. Jalil, SeFACED: Semantic-based forensic analysis and classification of e-mail data using deep learning, IEEE Access. 9 (2021), 98398-98411. doi:10.1109/ACCESS.2021.3095730
S. Xu, Y. Li, W. Zheng, Bayesian multinomial naïve bayes classifier to text classification, In: International Conference on Multimedia and Ubiquitous Engineering, Springer, Singapore 2017: 347–352. doi:10.1007/978-981-10-5041-1_57
R. O. Olanrewaju, S. A. Olanrewaju, L. A. Nafiu, Multinomial naïve bayes classifier: bayesian versus nonparametric classifier approach, European Journal of Statistics. 2 (8) (2022), 1-13. doi:10.28924/ada/stat.2.8
U. K. B. Saravanan, M. Vijay, T. Shreedhar, G. Rajasekar, R. Yashwanth, P. Shakthipriya, Multinomial Naive Bayes Based Machine Learning Analysis of Twitter Sentiment, In: IEEE 2nd International Conference on Edge Computing and Applications (ICECAA). Namakkal, India, 2023: 429-434. doi:10.1109/ICECAA58104.2023.10212150.
Y. K. Zamil, S. A. Ali, M. A. Naser, Spam image email filtering using k-nn and svm, International Journal of Electrical and Computer Engineering. 9(1) (2019), 245-254. doi:10.11591/ijece.v9i1. 245-254.
B. Trstenjak, S. Mikac, D. Donko, Knn with tf-idf based framework for text categorization, Procedia Engineering. 69 (2014), 1356-1364. doi:10.1016/j.proeng.2014.03.129
Z. Yong, L. Youwen, X. Shixiong, An improved knn text classification algorithm based on clustering, Journal of Computers. 4(3) (2009), 230-237. doi:10.4304/jcp.4.3.230-237
S. S. Ismail, R. F. Mansour, A. El-Aziz, M. Rasha, A. I. Taloba, Efficient e-mail spam detection strategy using genetic decision tree processing with NLP features, Computational Intelligence and Neuroscience. (2022), 1-16. doi:10.1155/2022/7710005.
D. Jurafsky, J. H. Martin, Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition with Language Models, (2024). https://web.stanford.edu/~jurafsky/slp3/5.pdf (accessed 21 September 2024).
S. Jamshidi, M. Mohammadi, S. Bagheri, H. E. Najafabadi, A. Rezvanian, M. Gheisari, M. Ghaderzadeh, A. S. Shahabi, Z. Wu, Effective text classification using BERT, MTM LSTM, and DT. Data & Knowledge Engineering. 151 (2024), 102306. doi:10.1016/j.datak.2024.102306.
Y. Yu, X. Si, C. Hu, J. Zhang, A review of recurrent neural networks: LSTM cells and network architectures, Neural Computation. 31(7) (2019), 1235-1270. doi:10.1162/neco a 01199
A. Purwarianti, I. A. P. A. Crisdayanti, Improving Bi-LSTM performance for Indonesian sentiment analysis using para44 graph vector, In: IEEE 2019 International Conference of Advanced Informatics: Concepts, Theory and Applications (ICAICTA), Bandung, Indonesia, 2019: 1-5. doi:10.1109/ICAICTA.2019.8904199.
Y. Xiong, N. Wei, K. Qiao, Z. Li and Z. Li, Exploring Consumption Intent in Live E-Commerce Barrage: A Text Feature-Based Approach Using BERT-BiLSTM Model, IEEE Access, 12 (2024), 69288-69298. doi: 10.1109/ACCESS.2024.3399095.
J. Wallat, F. Beringer, A. Anand, V. Anand, Probing BERT for Ranking Abilities. In: Kamps, J., et al. Advances in Information Retrieval. ECIR 2023. Lecture Notes in Computer Science, Springer, Cham, 2024: 13981. doi:10.1007/978-3-031-28238-6_17
B. Aytan, C. O. Sakar, Comparison of transformer-based models trained in turkish and different languages on turkish natural language processing problems, In: 2022 30th Signal Processing and Communications Applications Conference (SIU), Safranbolu, Turkey, 2022: 1-4. doi:10.1109/SIU55565.2022.9864818
E. Corp, W. W. Cohen. Enron Email Dataset, (2015). https://www.loc.gov/item/2018487913/ (accessed 23 September 2024).
H. Simsek, E. Aydemir. Classification of unwanted e-mails (spam) with turkish text by different algorithms in weka program, Journal of Soft Computing and Artificial Intelligence, 3 (2022), 1-4. doi:10.55195/jscai.1104694
UCI Machine Learning Repository. Turkish Spam V01 dataset, (2019). https://archive.ics.uci.edu/dataset/530/turkish+spam+v01 [accessed 15 December 2023].
W. Qader, M. Ameen, B. Ahmed, An overview of bag of words;importance, implementation, applications, and challenges, In: 2019 International Engineering Conference (IEC), Erbil, Iraq, 2019: 200-204. doi:10.1109/IEC47844.2019.8950616
L. Havrlant, V. Kreinovich, A simple probabilistic explanation of term frequency-inverse document frequency (tf-idf) heuristic (and variations motivated by this explanation), International Journal of General Systems, 46 (2017), 27-36. doi:10.1080/03081079.2017.1291635
A. Jalilifard, V. F. Carida, A. F. Mansano, R. S. Cristo, F. P. C. Fonseca, Semantic sensitive tf-idf to determine word relevance in documents, In: Advances in Computing and Network Communications, 2021: 327–337. doi:10.1007/978-981-33-6987-0
F. Zhang, W. Song, Product improvement in a big data environment: A novel method based on text mining and large group decision making, Expert Systems with Applications, 245 (2024), 123015, doi:10.1016/j.eswa.2023.123015.
J. Pennington, R. Socher, C. Manning, Glove: global vectors for word representation, In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 2014: 1532–1543. doi:10.3115/v1/D14-1162
Z. Hua, Y. Tong, Y. Zheng, Y. Li, and Y. Zhang, PPGlove: Privacy-Preserving Glove for Training Word Vectors in the Dark, IEEE Transactions on Information Forensics and Security. 19 (2024), 3644-3658. doi:10.1109/TIFS.2024.3364080
P. Bountakas, C. Xenakis, HELPHED: Hybrid ensemble learning phishing email detection, Journal of Network and Computer Applications. 210 (2023), 103545. doi:10.1016/j.jnca.2022.103545
O. Sagi, L. Rokach, Ensemble learning: a survey, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery. 8(4) (2018), e1249. doi:10.1002/widm.1249
G. Wang, J. Sun, J. Ma, K. Xu, J. Gu, Sentiment classification: the contribution of ensemble learning, Decision Support Systems. 57 (2014), 77-93. doi:10.1016/j.dss.2013.08.002
M. A. Ganaie, M. Hu, A. K. Malik, M. Tanveer, P. N. Suganthan, Ensemble deep learning: a review, Engineering Applications of Artificial Intelligence. 115 (2022), 105151. doi:10.1016/j.engappai.2022.105151.
N. C. Yang, K. L. Sung, Non-intrusive load classification and recognition using soft-voting ensemble learning algorithm with decision tree, k-nearest neighbor algorithm and multilayer perceptron, IEEE Access. 11 (2023), 94506-94520. doi:/10.1109/ACCESS.2023.3311641
A. Ghourabi, M. Alohaly, Enhancing spam message classification and detection using transformer-based embedding and ensemble learning, Sensors. 23 (8) (2023), 3861. doi:10.3390/s23083861

Türkçe ve İngilizce Dillerinde Spam Posta Tespiti: Bireysel, Toplu ve Hibrit Yaklaşımları İçeren Yapay Zeka Tabanlı Tekniklerin Bütünsel Bir Çalışması

Year 2025, Volume: 7 Issue: 2, 189 - 205, 31.08.2025

Esma Nisa Candan Rehnüma Küçükilhan Alperen Eroğlu

Abstract

Artan e-posta ve sosyal medya kullanımı nedeniyle spam sayısı artmış ve bu durumun sistemlere zarar vermeden etkili bir şekilde tespit edilmesi ve sınıflandırılması konusunda kritik bir zorluk oluşturmuştur. Bu makale, Türkçe ve İngilizce veri kümelerini kullanarak e-postaları spam veya ham olarak tespit etmek ve sınıflandırmak için en etkili yaklaşımları analiz etmek ve ortaya çıkarmak için bütünsel bir strateji sunmaktadır. Birleşik olarak oluşturulan yeni veri kümelerine ek olarak, farklı dillerde oluşturulan iki farklı veri kümesi kullanılmaktadır. Gelişmiş makine öğrenmesi ve derin öğrenme yaklaşımlarını temel alarak en iyi spam posta algılama yöntemlerini sunmak için karşılaştırmalı bir çalışma yapılmaktadır. Ayrıca yeni bir yaklaşım olarak spam posta tespiti için toplu ve hibrit öğrenme yöntemleri bir araya getirilmiştir. Optimize edilmiş özellik seçimi yaklaşımları ve ön işleme ile doğal dil işlemeyi ve geliştirilmiş öğrenme algoritmaları kullanılmaktadır. Literatürde yaygın olarak kullanılan Multinomial Naive Bayes, Destek Vektör Makinesi, Lojistik Regresyon, K-En Yakın Komşular, Karar Ağacı, Rastgele Orman, Oylama sınıflandırıcısı ve makine öğrenme algoritmaları olarak Yığınlama Sınıflandırıcısı ile Uzun Kısa Süreli Bellek, Çift Yönlü yöntemlerini karşılaştırmaktayız. Uzun Kısa Süreli Bellek, Transformatörlerden Çift Yönlü Kodlayıcı Gösterimleri ise derin öğrenme algoritmaları olarak kullanılmaktadır. 5 kat çapraz doğrulamaya ek olarak, veri kümeleri her model için 80:20 oranlarıyla eğitim verileri ve test verileri olarak bölünmüştür. Izgara Arama tekniği kullanılarak modellerin hiper parametreleri de optimize edilmektedir. Makine öğrenmesi yaklaşımlarına dayalı toplu öğrenme yöntemi, İngilizce Enron veri seti için %99,9 ile en iyi performansı sağlarken, basit ortalamaya dayalı hibrit toplu öğrenme yaklaşımı, UCI ve Kaggle'dan Türkçe veri seti için %98,43 ile en iyi doğruluk değerini vermektedir.

Keywords

Hibrit Öğrenme , İngilizce Veri Setleri , Toplu Öğrenme , Türkçe Veri Setleri , Spam Mail

References

J. Doshi, K. Parmar, R. Sanghavi, N. Shekokar, A comprehensive dual-layer architecture for phishing and spam email detection, Computer & Security. 133 (2023), 103378. doi:10.1016/j.cose.2023.103378
N. Saidani, K. Adi, MS. Allili, A semantic-based classification approach for an enhanced spam detection. Computer & Security. 94 (2020), 101716. doi:10.1016/j.cose.2020.101716
B. Feng, Q. Fu, M. Dong, D. Guo, Q. Li, Multistage and elastic spam detection in mobile social networks through deep learning, IEEE Network. 32(4) (2018), 15-21. doi:10.1109/MNET.2018.1700406
A. Karim, S. Azam, B. Shanmugam, K. Kannoorpatti, M. Alazab, A comprehensive survey for intelligent spam email detection, IEEE Access. 7 (2019), 168261-168295. doi:10.1109/ACCESS.2019.2954791
S. Gibson, B. Issac, L. Zhang, SM. Jacob, Detecting spam email with machine learning optimized with bio-inspired metaheuristic algorithms, IEEE Access. 8 (2020), 187914-187932. doi:10.1109/ACCESS.2020.3030751
S. Rapacz, P. Chołda, M. Natkaniec, A method for fast selection of machine-learning classifiers for spam filtering, Electronics. 10(17) (2021), 2083. doi:10.3390/electronics10172083
S. Magdy, Y. Abouelseoud, M. Mikhail, Efficient spam and phishing email filtering based on deep learning, Computer Networks. 206 (2022), 108826. doi:10.1016/j.comnet.2022.108826
F. Ozen, R. Ortac Kabaoglu, T. V. Mumcu, Deep Learning Based Temperature and Humidity Prediction, Necmettin Erbakan University Journal of Science and Engineering. 5(2) (2023). 219-229. doi:10.47112/neufmbd.2023.20
M. Hacıbeyoglu, M. Çelik, Ö. Erdaş Çiçek, Energy Efficiency Estimation in Buildings with K Nearest Neighbor Algorithm, Necmettin Erbakan University Journal of Science and Engineering, 5(2) (2023), 65-74. doi:10.47112/neufmbd.2023.10
A. Pektaş, O. İnan, Application of Tree Seed Algorithm on Clustering Problems, Necmettin Erbakan University Journal of Science and Engineering. 4(1) (2022), 1-10. doi:10.47112/neufmbd.2022.8
F. Rustam, N. Saher, A. Mehmood, E. Lee, S. Washington, I. Ashraf, Detecting ham and spam emails using feature union and supervised machine learning models, Multimedia Tools and Applications. 82 (2023), 26545–26561. doi: 10.1007/s11042-023-14814-2
E. E. Eryılmaz, D. Ö. Şahin, E. Kılıç, Türkçe İstenmeyen E-postaların Farklı Öznitelik Seçim Yöntemleri Kullanılarak Makine Öğrenmesi Algoritmaları ile Tespit Edilmesi, Türkiye Bilişim Vakfı-Bilgisayar Bilimleri ve Mühendisliği Dergisi. 13 (2) (2020), 57-77.
M. A. Shaaban, Y. F. Hassan, S. K. Guirguis, Deep convolutional forest: a dynamic deep ensemble approach for spam detection in text, Complex & Intelligent Systems. 8(6) (2022), 4897-4909. doi:10.1007/s40747-022-00741-6
S. Kaddoura, O. Alfandi, N. Dahmani, A spam email detection mechanism for English language text emails using a deep learning approach, In: 29th International Conference on Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE), IEEE, Bayonne, France, 2020: 193-198. doi:10.1109/WETICE49692.2020.00045
T. Toma, S. Hassan, M. Arifuzzaman, An analysis of supervised machine learning algorithms for spam email detection, In: International Conference on Automation, Control, and Mechatronics for Industry 4.0 (ACMI), IEEE, Rajshahi, Bangladesh, 2021: 1-5. doi:10.1109/ACMI53878.2021.9528108
C. M. Shaik, N. M. Penumaka, S. K. Abbireddy, V. Kumar, S. Aravinth, Bi-LSTM and conventional classifiers for email spam filtering, In: Third International Conference on Artificial Intelligence and Smart Energy (ICAIS), IEEE, Coimbatore, India, 2023: 1350-1355. doi:10.1109/ICAIS56108.2023.10073776
K. Debnath, N. Kar. Email spam detection using deep learning approach, In: International Conference on Machine Learning, Big Data, Cloud and Parallel Computing (COM-IT-CON), IEEE, Faridabad, India, 2022: 37-41. doi:10.1109/COM-IT-CON54601.2022.9850588
A. R. Yeruva, D. Kamboj, P. Shankar, U. S. Aswal, A. K. Rao, C. Somu, E-mail spam detection using machine learning—KNN, In: 5th International Conference on Contemporary Computing and Informatics (IC3I), IEEE, Uttar Pradesh, India, 2022: 1024-1028. doi:10.1109/IC3I56241.2022.10072628
N. Ahmed, R. Amin, H. Aldabbas, D. Koundal, B. Alouffi, T. Shah, Machine learning techniques for spam detection in email and IoT platforms: Analysis and research challenges, Security and Communication Networks. (1) (2022), 1-19. doi:10.1155/2024/7538203
Z. B. Siddique, M. A. Khan, I. U. Din, A. Almogren, I. Mohiuddin, S. Nazir, Machine learning-based detection of spam emails, Scientific Programming. (1) (2021), 1-11. doi:10.1155/2021/6508784
Z. Hassani, V. Hajihashemi, K. Borna, I. S. Dehmajnoonie, A classification method for E-mail spam using a hybrid approach for feature selection optimization, Journal of Sciences, Islamic Republic of Iran. 31(2), (2020), 165-173.
A. Sheneamer, Comparison of deep and traditional learning methods for email spam filtering, International Journal of Advanced Computer Science and Applications (IJACSA). 12 (1) (2021), 560-565. doi: 10.14569/IJACSA.2021.0120164
S. Zavrak, S. Yilmaz, Email spam detection using hierarchical attention hybrid deep learning method, Expert Systems with Applications. 233 (2023), 120977. doi: 10.1016/j.eswa.2023.120977
G. Hnini, J. Riffi, M. A. Mahraz, A. Yahyaouy, H. Tairi, MMPC-RF: a deep multimodal feature-level fusion architecture for hybrid spam E-mail detection, Applied Sciences. 11(24) (2021), 11968. doi: 10.3390/app112411968
A. I. Taloba, S. S. Ismail, An intelligent hybrid technique of decision tree and genetic algorithm for e-mail spam detection, In: IEEE 2019 Ninth International Conference on Intelligent Computing and Information Systems (ICICIS), IEEE, Cairo, Egypt, 2019: 99-104. doi:10.1109/ICICIS46948.2019.9014756
K. Meena, S. R. Upadhyaya, A Privacy-Preserving machine learning ensemble for spam detection, In: IEEE 5th International Conference on Inventive Research in Computing Applications (ICIRCA), IEEE, Coimbatore, India, 2023: 255-259.
M. Sumathi, S. Raja, Machine learning algorithm-based spam detection in social networks, Social Network Analysis and Mining. 13(1) (2023), 104. doi:10.1007/s13278-023-01108-6
M. Hina, M. Ali, A. R. Javed, F. Ghabban, L. A. Khan, Z. Jalil, SeFACED: Semantic-based forensic analysis and classification of e-mail data using deep learning, IEEE Access. 9 (2021), 98398-98411. doi:10.1109/ACCESS.2021.3095730
S. Xu, Y. Li, W. Zheng, Bayesian multinomial naïve bayes classifier to text classification, In: International Conference on Multimedia and Ubiquitous Engineering, Springer, Singapore 2017: 347–352. doi:10.1007/978-981-10-5041-1_57
R. O. Olanrewaju, S. A. Olanrewaju, L. A. Nafiu, Multinomial naïve bayes classifier: bayesian versus nonparametric classifier approach, European Journal of Statistics. 2 (8) (2022), 1-13. doi:10.28924/ada/stat.2.8
U. K. B. Saravanan, M. Vijay, T. Shreedhar, G. Rajasekar, R. Yashwanth, P. Shakthipriya, Multinomial Naive Bayes Based Machine Learning Analysis of Twitter Sentiment, In: IEEE 2nd International Conference on Edge Computing and Applications (ICECAA). Namakkal, India, 2023: 429-434. doi:10.1109/ICECAA58104.2023.10212150.
Y. K. Zamil, S. A. Ali, M. A. Naser, Spam image email filtering using k-nn and svm, International Journal of Electrical and Computer Engineering. 9(1) (2019), 245-254. doi:10.11591/ijece.v9i1. 245-254.
B. Trstenjak, S. Mikac, D. Donko, Knn with tf-idf based framework for text categorization, Procedia Engineering. 69 (2014), 1356-1364. doi:10.1016/j.proeng.2014.03.129
Z. Yong, L. Youwen, X. Shixiong, An improved knn text classification algorithm based on clustering, Journal of Computers. 4(3) (2009), 230-237. doi:10.4304/jcp.4.3.230-237
S. S. Ismail, R. F. Mansour, A. El-Aziz, M. Rasha, A. I. Taloba, Efficient e-mail spam detection strategy using genetic decision tree processing with NLP features, Computational Intelligence and Neuroscience. (2022), 1-16. doi:10.1155/2022/7710005.
D. Jurafsky, J. H. Martin, Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition with Language Models, (2024). https://web.stanford.edu/~jurafsky/slp3/5.pdf (accessed 21 September 2024).
S. Jamshidi, M. Mohammadi, S. Bagheri, H. E. Najafabadi, A. Rezvanian, M. Gheisari, M. Ghaderzadeh, A. S. Shahabi, Z. Wu, Effective text classification using BERT, MTM LSTM, and DT. Data & Knowledge Engineering. 151 (2024), 102306. doi:10.1016/j.datak.2024.102306.
Y. Yu, X. Si, C. Hu, J. Zhang, A review of recurrent neural networks: LSTM cells and network architectures, Neural Computation. 31(7) (2019), 1235-1270. doi:10.1162/neco a 01199
A. Purwarianti, I. A. P. A. Crisdayanti, Improving Bi-LSTM performance for Indonesian sentiment analysis using para44 graph vector, In: IEEE 2019 International Conference of Advanced Informatics: Concepts, Theory and Applications (ICAICTA), Bandung, Indonesia, 2019: 1-5. doi:10.1109/ICAICTA.2019.8904199.
Y. Xiong, N. Wei, K. Qiao, Z. Li and Z. Li, Exploring Consumption Intent in Live E-Commerce Barrage: A Text Feature-Based Approach Using BERT-BiLSTM Model, IEEE Access, 12 (2024), 69288-69298. doi: 10.1109/ACCESS.2024.3399095.
J. Wallat, F. Beringer, A. Anand, V. Anand, Probing BERT for Ranking Abilities. In: Kamps, J., et al. Advances in Information Retrieval. ECIR 2023. Lecture Notes in Computer Science, Springer, Cham, 2024: 13981. doi:10.1007/978-3-031-28238-6_17
B. Aytan, C. O. Sakar, Comparison of transformer-based models trained in turkish and different languages on turkish natural language processing problems, In: 2022 30th Signal Processing and Communications Applications Conference (SIU), Safranbolu, Turkey, 2022: 1-4. doi:10.1109/SIU55565.2022.9864818
E. Corp, W. W. Cohen. Enron Email Dataset, (2015). https://www.loc.gov/item/2018487913/ (accessed 23 September 2024).
H. Simsek, E. Aydemir. Classification of unwanted e-mails (spam) with turkish text by different algorithms in weka program, Journal of Soft Computing and Artificial Intelligence, 3 (2022), 1-4. doi:10.55195/jscai.1104694
UCI Machine Learning Repository. Turkish Spam V01 dataset, (2019). https://archive.ics.uci.edu/dataset/530/turkish+spam+v01 [accessed 15 December 2023].
W. Qader, M. Ameen, B. Ahmed, An overview of bag of words;importance, implementation, applications, and challenges, In: 2019 International Engineering Conference (IEC), Erbil, Iraq, 2019: 200-204. doi:10.1109/IEC47844.2019.8950616
L. Havrlant, V. Kreinovich, A simple probabilistic explanation of term frequency-inverse document frequency (tf-idf) heuristic (and variations motivated by this explanation), International Journal of General Systems, 46 (2017), 27-36. doi:10.1080/03081079.2017.1291635
A. Jalilifard, V. F. Carida, A. F. Mansano, R. S. Cristo, F. P. C. Fonseca, Semantic sensitive tf-idf to determine word relevance in documents, In: Advances in Computing and Network Communications, 2021: 327–337. doi:10.1007/978-981-33-6987-0
F. Zhang, W. Song, Product improvement in a big data environment: A novel method based on text mining and large group decision making, Expert Systems with Applications, 245 (2024), 123015, doi:10.1016/j.eswa.2023.123015.
J. Pennington, R. Socher, C. Manning, Glove: global vectors for word representation, In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 2014: 1532–1543. doi:10.3115/v1/D14-1162
Z. Hua, Y. Tong, Y. Zheng, Y. Li, and Y. Zhang, PPGlove: Privacy-Preserving Glove for Training Word Vectors in the Dark, IEEE Transactions on Information Forensics and Security. 19 (2024), 3644-3658. doi:10.1109/TIFS.2024.3364080
P. Bountakas, C. Xenakis, HELPHED: Hybrid ensemble learning phishing email detection, Journal of Network and Computer Applications. 210 (2023), 103545. doi:10.1016/j.jnca.2022.103545
O. Sagi, L. Rokach, Ensemble learning: a survey, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery. 8(4) (2018), e1249. doi:10.1002/widm.1249
G. Wang, J. Sun, J. Ma, K. Xu, J. Gu, Sentiment classification: the contribution of ensemble learning, Decision Support Systems. 57 (2014), 77-93. doi:10.1016/j.dss.2013.08.002
M. A. Ganaie, M. Hu, A. K. Malik, M. Tanveer, P. N. Suganthan, Ensemble deep learning: a review, Engineering Applications of Artificial Intelligence. 115 (2022), 105151. doi:10.1016/j.engappai.2022.105151.
N. C. Yang, K. L. Sung, Non-intrusive load classification and recognition using soft-voting ensemble learning algorithm with decision tree, k-nearest neighbor algorithm and multilayer perceptron, IEEE Access. 11 (2023), 94506-94520. doi:/10.1109/ACCESS.2023.3311641
A. Ghourabi, M. Alohaly, Enhancing spam message classification and detection using transformer-based embedding and ensemble learning, Sensors. 23 (8) (2023), 3861. doi:10.3390/s23083861

There are 57 citations in total.

Details

Primary Language	English
Subjects	Information Security Management, Deep Learning, Natural Language Processing
Journal Section	Articles
Authors	Esma Nisa Candan This is me 0000-0002-9746-3495 Rehnüma Küçükilhan This is me 0009-0009-3930-6502 Alperen Eroğlu 0000-0002-1780-7025
Publication Date	August 31, 2025
Submission Date	September 25, 2024
Acceptance Date	November 14, 2024
Published in Issue	Year 2025 Volume: 7 Issue: 2

Cite

APA	Candan, E. N., Küçükilhan, R., & Eroğlu, A. (2025). Spam Mail Detection in Turkish and English Languages: A Holistic Study of AI-based Techniques including Individual, Ensemble and Hybrid Approaches. Necmettin Erbakan Üniversitesi Fen Ve Mühendislik Bilimleri Dergisi, 7(2), 189-205.
AMA	Candan EN, Küçükilhan R, Eroğlu A. Spam Mail Detection in Turkish and English Languages: A Holistic Study of AI-based Techniques including Individual, Ensemble and Hybrid Approaches. NEJSE. August 2025;7(2):189-205.
Chicago	Candan, Esma Nisa, Rehnüma Küçükilhan, and Alperen Eroğlu. “Spam Mail Detection in Turkish and English Languages: A Holistic Study of AI-Based Techniques Including Individual, Ensemble and Hybrid Approaches”. Necmettin Erbakan Üniversitesi Fen Ve Mühendislik Bilimleri Dergisi 7, no. 2 (August 2025): 189-205.
EndNote	Candan EN, Küçükilhan R, Eroğlu A (August 1, 2025) Spam Mail Detection in Turkish and English Languages: A Holistic Study of AI-based Techniques including Individual, Ensemble and Hybrid Approaches. Necmettin Erbakan Üniversitesi Fen ve Mühendislik Bilimleri Dergisi 7 2 189–205.
IEEE	E. N. Candan, R. Küçükilhan, and A. Eroğlu, “Spam Mail Detection in Turkish and English Languages: A Holistic Study of AI-based Techniques including Individual, Ensemble and Hybrid Approaches”, NEJSE, vol. 7, no. 2, pp. 189–205, 2025.
ISNAD	Candan, Esma Nisa et al. “Spam Mail Detection in Turkish and English Languages: A Holistic Study of AI-Based Techniques Including Individual, Ensemble and Hybrid Approaches”. Necmettin Erbakan Üniversitesi Fen ve Mühendislik Bilimleri Dergisi 7/2 (August2025), 189-205.
JAMA	Candan EN, Küçükilhan R, Eroğlu A. Spam Mail Detection in Turkish and English Languages: A Holistic Study of AI-based Techniques including Individual, Ensemble and Hybrid Approaches. NEJSE. 2025;7:189–205.
MLA	Candan, Esma Nisa et al. “Spam Mail Detection in Turkish and English Languages: A Holistic Study of AI-Based Techniques Including Individual, Ensemble and Hybrid Approaches”. Necmettin Erbakan Üniversitesi Fen Ve Mühendislik Bilimleri Dergisi, vol. 7, no. 2, 2025, pp. 189-05.
Vancouver	Candan EN, Küçükilhan R, Eroğlu A. Spam Mail Detection in Turkish and English Languages: A Holistic Study of AI-based Techniques including Individual, Ensemble and Hybrid Approaches. NEJSE. 2025;7(2):189-205.

Download Cover Image

Article Files

Full Text