Two-Stage malicious URL detection architecture: A next-generation threat recognition and classification approach
Year 2025,
Volume: 14 Issue: 4, 1498 - 1508, 15.10.2025
Durmuş Özkan Şahin
,
Sercan Demirci
,
Muhammet Abdullah Şahin
,
Nuri Can Acar
,
Hamit Burak Can Kodal
Abstract
The increasing prevalence of online threats has made the detection and classification of malicious URLs a critical research topic. This study aims to evaluate and compare the performance of different machine learning algorithms combined with feature selection methods for detecting malicious URLs and determining their attack types. Three different datasets were used in the study. The first dataset, ISCX-2016, contains 79 distinct features. The same 79 features were extracted for each URL in the other two raw URL datasets. In the modeling process, the first phase focused on accurately classifying malicious URLs, while the second phase compared the effectiveness of algorithms in determining attack types. The Random Forest algorithm, when applied without feature selection and evaluated across all features, achieved the highest performance in both binary classification (97% accuracy) and multi-class classification (98% accuracy). These findings serve as a valuable guide for the development of malicious URL detection systems and provide significant contributions to the literature.
References
-
A. Sertçelik, Siber olaylar ekseninde siber güvenliği anlamak. Medeniyet Araştırmaları Dergisi, 2 (3), 25-42, 2015.
-
RFC 1738, Uniform resource locators (URL), T. Berners-Lee, L. Masinter, & M. McCahill, (1994). https://doi.org/10.17487/rfc1738.
-
P. Prakash, M. Kumar, R. R. Kompella and M. Gupta, PhishNet: Predictive blacklisting to detect phishing attacks. 2010 Proceedings IEEE INFOCOM, sayfa 1-5, San Diego, CA, USA, 2010.
https://doi.org/10.1109/INFCOM.2010.5462216
-
Y. Fukushima, Y. Hori and K. Sakurai, Proactive blacklisting for malicious web sites by reputation evaluation based on domain and IP address registration. 2011 IEEE 10th International Conference on Trust, Security and Privacy in Computing and Communications, sayfa 352-361, Changsha, China, 2011. https://doi.org/10.1109/TrustCom.2011.46
-
A. K. Jain and B. B. Gupta, A novel approach to protect against phishing attacks at client side using auto-updated white-list. EURASIP Journal on Information Security, 2016, sayfa 1-11, 2016. https://doi.org/10.1186/s13635-016-0034-3
-
H. Choi, B. B. Zhu and H. Lee, Detecting malicious web links and ıdentifying their attack types. 2nd USENIX Conference on Web Application Development (WebApps 11), Berkeley, USA, 2011.
-
W. Chu, B. B. Zhu, F. Xue, X. Guan and Z. Cai, Protect sensitive sites from phishing attacks using features extractable from ınaccessible phishing URLS. 2013 IEEE International Conference on Communications (ICC), sayfa 1990-1994, Budapest, Hungary, 2013.
https://doi.org/10.1109/ICC.2013.6654816
-
M. S. Lin, C. Y. Chiu, Y. J. Lee and H. K. Pao, Malicious URL filtering – A big data application. 2013 IEEE international conference on big data, sayfa 589-596, Santa Clara, CA, USA, 2013.
https://doi.org/10.1109/BigData.2013.6691627
-
A. Joshi, L. Lloyd, P. Westin and S. Seethapathy, Using lexical features for malicious URL detection--a machine learning approach. arXiv preprint arXiv:1910.06277, 2019.
-
A. Powell, D. Bates, C. Van Wyk and A. D. de Abreu, A cross-comparison of feature selection algorithms on multiple cyber security data-sets. Proceedings of the FAIR 2019 Workshop, sayfa. 196-207, Cape Town, South Africa, 2019.
-
O. K. Sahingoz, E. Buber, O. Demir and B. Diri, Machine learning based phishing detection from URLs. Expert Systems with Applications, 117, 345-357, 2019. https://doi.org/10.1016/j.eswa.2018.09.029.
-
S. Wang, Y. Wang and M. Tang, Auto malicious websites classification based on Naive Bayes Classifier. 2020 IEEE 3rd International Conference on Information Systems and Computer Aided Education (ICISCAE), sayfa. 443-447, Dalian, China, 2020. https://doi.org/10.1109/ICISCAE51034.2020.9236912
-
S. Singhal, U. Chawla and R. Shorey, Machine learning & Concept drift based approach for malicious website detection. 2020 12th International Conference on Communication Systems & Networks (COMSNETS), sayfa 582-585, Bangalore, India, 2020.
https://doi.org/10.1109/COMSNETS48256.2020.9027485
-
R. S. Arslan, Kötücül web sayfalarının tespitinde Doc2Vec modeli ve makine öğrenmesi yaklaşımı. Avrupa Bilim ve Teknoloji Dergisi, (27), 792-801, 2021. https://doi.org/10.31590/ejosat.981450.
-
S. S. K. Singh, V. Menon, S. A. Sajidha, V. M. Nisha, A. Sheik Abdullah, M. Nivedita and A. Mairaj, Meta learning for enhanced web security against malicious URLs. Research Square, 2023. https://doi.org/10.21203/rs.3.rs-3626868/v1
-
K. Sadaf, Phishing website detection using XGBoost and Catboost classifiers. 2023 International Conference on Smart Computing and Application (ICSCA), sayfa 1-6, Bali, Indonesia, 2023.
https://doi.org/10.1109/ICSCA57840.2023.10087829
-
S. Sheikhi and P. Kostakos, Safeguarding cyberspace: Enhancing malicious website detection with PSO-optimized XGBoost and firefly-based feature selection. Computers & Security, 142, 103885, 2024. https://doi.org/10.1016/j.cose.2024.103885.
-
A. E. Omolara and M. Alawida, DaE2: Unmasking malicious URLs by leveraging diverse and efficient ensemble machine learning for online security. Computers & Security, 148, 104170, 2025. https://doi.org/10.1016/j.cose.2024.104170
-
Y. A. Kustiawan and K. I. Ghauth, Evaluating the ımpact of feature engineering in phishing URL detection: A comparative study of URL, HTML, and derived features. IEEE Access, 13, 126756-126768, 2024. https://doi.org/10.1109/ACCESS.2025.3579223
-
H. R. Alavala, S. Singh, P. Joshi and S. Basavaraju, Enhancing malicious URL Detection with advanced machine learning techniques. 2025 First International Conference on Advances in Computer Science, Electrical, Electronics, and Communication Technologies (CE2CT), sayfa 151-156, Bengaluru, India, 2025. https://doi.org/10.1109/CE2CT64011.2025.10939290
-
Canadian Institute for Cybersecurity. ISCX-URL-2016 dataset. https://www.unb.ca/cic/datasets/url-2016.html, Accessed 28 December 2024
-
Siddhartha, M. Malicious URLs dataset. https://www.kaggle.com/datasets/sid321axn/malicious-urls-dataset, Accessed 28 December 2024
-
TeseRact. URL dataset. https://www.kaggle.com/datasets/teseract/urldataset, Accessed 28 December 2024
İki aşamalı kötücül URL tespit mimarisi: Yeni nesil tehdit tanıma ve sınıflandırma yaklaşımı
Year 2025,
Volume: 14 Issue: 4, 1498 - 1508, 15.10.2025
Durmuş Özkan Şahin
,
Sercan Demirci
,
Muhammet Abdullah Şahin
,
Nuri Can Acar
,
Hamit Burak Can Kodal
Abstract
Günümüzde çevrimiçi tehditlerin artması, zararlı URL'lerin tespit edilmesini ve zarar türlerine göre sınıflandırılmasını önemli bir araştırma konusu haline getirmiştir. Bu çalışma, zararlı URL'lerin tespit edilmesi ve zarar türlerinin belirlenmesi amacıyla farklı makine öğrenimi algoritmaları ile özellik seçme yöntemlerinin kombinasyonları üzerine değerlendirme yaparak performanslarını karşılaştırmayı hedeflemektedir. Çalışmada üç farklı veri seti kullanılmıştır. Bunlardan birincisi ISCX-2016 veri seti olup 79 farklı özellik içermektedir. Ham URL’lerden oluşan diğer iki veri setinde bulunan her bir URL için aynı 79 özellik çıkarılmıştır. Modelleme sürecinde, ilk aşamada zararlı URL’lerin doğru bir şekilde sınıflandırılmasına odaklanılmış, ikinci aşamada ise zarar türlerinin belirlenmesinde kullanılan algoritmaların etkinliği karşılaştırılmıştır. Özellik seçimi uygulanmadan kullanılan Rastgele Orman algoritması, tüm özellikler üzerinden değerlendirildiğinde, hem ikili sınıflandırmada (%97 doğruluk) hem de çok sınıflı sınıflandırmada (%98 doğruluk) en yüksek performansa ulaşmıştır. Bu bulgular, zararlı URL tespit sistemlerinin geliştirilmesi açısından önemli bir rehber niteliği taşımakta ve literatüre değerli katkılar sunmaktadır.
References
-
A. Sertçelik, Siber olaylar ekseninde siber güvenliği anlamak. Medeniyet Araştırmaları Dergisi, 2 (3), 25-42, 2015.
-
RFC 1738, Uniform resource locators (URL), T. Berners-Lee, L. Masinter, & M. McCahill, (1994). https://doi.org/10.17487/rfc1738.
-
P. Prakash, M. Kumar, R. R. Kompella and M. Gupta, PhishNet: Predictive blacklisting to detect phishing attacks. 2010 Proceedings IEEE INFOCOM, sayfa 1-5, San Diego, CA, USA, 2010.
https://doi.org/10.1109/INFCOM.2010.5462216
-
Y. Fukushima, Y. Hori and K. Sakurai, Proactive blacklisting for malicious web sites by reputation evaluation based on domain and IP address registration. 2011 IEEE 10th International Conference on Trust, Security and Privacy in Computing and Communications, sayfa 352-361, Changsha, China, 2011. https://doi.org/10.1109/TrustCom.2011.46
-
A. K. Jain and B. B. Gupta, A novel approach to protect against phishing attacks at client side using auto-updated white-list. EURASIP Journal on Information Security, 2016, sayfa 1-11, 2016. https://doi.org/10.1186/s13635-016-0034-3
-
H. Choi, B. B. Zhu and H. Lee, Detecting malicious web links and ıdentifying their attack types. 2nd USENIX Conference on Web Application Development (WebApps 11), Berkeley, USA, 2011.
-
W. Chu, B. B. Zhu, F. Xue, X. Guan and Z. Cai, Protect sensitive sites from phishing attacks using features extractable from ınaccessible phishing URLS. 2013 IEEE International Conference on Communications (ICC), sayfa 1990-1994, Budapest, Hungary, 2013.
https://doi.org/10.1109/ICC.2013.6654816
-
M. S. Lin, C. Y. Chiu, Y. J. Lee and H. K. Pao, Malicious URL filtering – A big data application. 2013 IEEE international conference on big data, sayfa 589-596, Santa Clara, CA, USA, 2013.
https://doi.org/10.1109/BigData.2013.6691627
-
A. Joshi, L. Lloyd, P. Westin and S. Seethapathy, Using lexical features for malicious URL detection--a machine learning approach. arXiv preprint arXiv:1910.06277, 2019.
-
A. Powell, D. Bates, C. Van Wyk and A. D. de Abreu, A cross-comparison of feature selection algorithms on multiple cyber security data-sets. Proceedings of the FAIR 2019 Workshop, sayfa. 196-207, Cape Town, South Africa, 2019.
-
O. K. Sahingoz, E. Buber, O. Demir and B. Diri, Machine learning based phishing detection from URLs. Expert Systems with Applications, 117, 345-357, 2019. https://doi.org/10.1016/j.eswa.2018.09.029.
-
S. Wang, Y. Wang and M. Tang, Auto malicious websites classification based on Naive Bayes Classifier. 2020 IEEE 3rd International Conference on Information Systems and Computer Aided Education (ICISCAE), sayfa. 443-447, Dalian, China, 2020. https://doi.org/10.1109/ICISCAE51034.2020.9236912
-
S. Singhal, U. Chawla and R. Shorey, Machine learning & Concept drift based approach for malicious website detection. 2020 12th International Conference on Communication Systems & Networks (COMSNETS), sayfa 582-585, Bangalore, India, 2020.
https://doi.org/10.1109/COMSNETS48256.2020.9027485
-
R. S. Arslan, Kötücül web sayfalarının tespitinde Doc2Vec modeli ve makine öğrenmesi yaklaşımı. Avrupa Bilim ve Teknoloji Dergisi, (27), 792-801, 2021. https://doi.org/10.31590/ejosat.981450.
-
S. S. K. Singh, V. Menon, S. A. Sajidha, V. M. Nisha, A. Sheik Abdullah, M. Nivedita and A. Mairaj, Meta learning for enhanced web security against malicious URLs. Research Square, 2023. https://doi.org/10.21203/rs.3.rs-3626868/v1
-
K. Sadaf, Phishing website detection using XGBoost and Catboost classifiers. 2023 International Conference on Smart Computing and Application (ICSCA), sayfa 1-6, Bali, Indonesia, 2023.
https://doi.org/10.1109/ICSCA57840.2023.10087829
-
S. Sheikhi and P. Kostakos, Safeguarding cyberspace: Enhancing malicious website detection with PSO-optimized XGBoost and firefly-based feature selection. Computers & Security, 142, 103885, 2024. https://doi.org/10.1016/j.cose.2024.103885.
-
A. E. Omolara and M. Alawida, DaE2: Unmasking malicious URLs by leveraging diverse and efficient ensemble machine learning for online security. Computers & Security, 148, 104170, 2025. https://doi.org/10.1016/j.cose.2024.104170
-
Y. A. Kustiawan and K. I. Ghauth, Evaluating the ımpact of feature engineering in phishing URL detection: A comparative study of URL, HTML, and derived features. IEEE Access, 13, 126756-126768, 2024. https://doi.org/10.1109/ACCESS.2025.3579223
-
H. R. Alavala, S. Singh, P. Joshi and S. Basavaraju, Enhancing malicious URL Detection with advanced machine learning techniques. 2025 First International Conference on Advances in Computer Science, Electrical, Electronics, and Communication Technologies (CE2CT), sayfa 151-156, Bengaluru, India, 2025. https://doi.org/10.1109/CE2CT64011.2025.10939290
-
Canadian Institute for Cybersecurity. ISCX-URL-2016 dataset. https://www.unb.ca/cic/datasets/url-2016.html, Accessed 28 December 2024
-
Siddhartha, M. Malicious URLs dataset. https://www.kaggle.com/datasets/sid321axn/malicious-urls-dataset, Accessed 28 December 2024
-
TeseRact. URL dataset. https://www.kaggle.com/datasets/teseract/urldataset, Accessed 28 December 2024