Advanced Phishing Detection: Leveraging t-SNE Feature Extraction and Machine Learning on a Comprehensive URL Dataset

Taha Etem; Mustafa Teke

doi:10.26650/acin.1521835

Research Article

Year 2024, Volume: 8 Issue: 2, 213 - 221, 31.12.2024

Taha Etem , Mustafa Teke

https://doi.org/10.26650/acin.1521835

Abstract

References

Aburrous, M., Hossain, M. A., Dahal, K., & Thabtah, F. (2010). Intelligent phishing detection system for e-banking using fuzzy data mining. Expert Systems with Applications, 37(12), 7913-7921. doi:10.1016/J.ESWA.2010.04.044 google scholar
Adebowale, M. A., Lwin, K. T., & Hossain, M. A. (2019). Deep learning with convolutional neural network and long short-term memory for phishing detection. 2019 13th International Conference on Software, Knowledge, Information Management and Applications, SKIMA 2019. doi:10.1109/SKIMA47702.2019.8982427 google scholar
Alam, M. N., Sarma, D., Lima, F. F., Saha, I., Ulfath, R. E., & Hossain, S. (2020). Phishing Attacks Detection using Machine Learning Approach. 2020 Third International Conference on Smart Systems and Inventive Technology (ICSSIT), 1173-1179. doi:10.1109/ICSSIT48917.2020.9214225 google scholar
Alhudhaif, A., Almaslukh, B., Aseeri, A. O., Guler, O., & Polat, K. (2023). A novel nonlinear automated multi-class skin lesion detection system using soft-attention based convolutional neural networks. Chaos, Solitons & Fractals, 170, 113409. doi:10.1016/J.CHAOS.2023.113409 google scholar
Alsaç, A., Yenisey, M. M., Ganiz, M., Dagtekin, M., & Ulusinan, T. (2023). The Efficiency of Regularization Method on Model Success in Issue Type Prediction Problem. Acta Infologica, 7(2), 360-383. doi:10.26650/ACIN.1394019 google scholar
Atawneh, S., & Aljehani, H. (2023). Phishing Email Detection Model Using Deep Learning. Electronics 2023, Vol. 12, Page 4261, 12(20), 4261. doi:10.3390/ELECTRONICS12204261 google scholar
Bergholz, A., De Beer, J., Glahn, S., Moens, M. F., PaaB, G., & Strobel, S. (2010). New filtering approaches for phishing email. Journal of Computer Security, 18(1), 7-35. doi:10.3233/JCS-2010-0371 google scholar
Bibal, A., Delchevalerie, V., & Frenay, B. (2023). DT-SNE: t-SNE discrete visualizations as decision tree structures. Neurocomputing, 529, 101-112. doi:10.1016/J.NEUCOM.2023.01.073 google scholar
Bibi, H., Shah, S. R., Baig, M. M., Sharif, M. I., Mehmood, M., Akhtar, Z., & Siddique, K. (2024). Phishing Website Detection Using Improved Multilayered Convolutional Neural Networks. Journal of Computer Science, 20(9), 1069-1079. doi:10.3844/JCSSP.2024.1069.1079 google scholar
Buyrukoğlu, S., & Savaş, S. (2023). Stacked-Based Ensemble Machine Learning Model for Positioning Footballer. Arabian Journal for Science and Engineering, 48(2), 1371-1383. doi:10.1007/s13369-022-06857-8 google scholar
Divakaran, D. M., & Oest, A. (2022). Phishing Detection Leveraging Machine Learning and Deep Learning: A Review. IEEE Security and Privacy, 20(5), 86-95. doi:10.1109/MSEC.2022.3175225 google scholar
Doğruel, M., & Soner Kara, S. (2023). Determining the Happiness Class of Countries with Tree-Based Algorithms in Machine Learning. Acta Infologica, 7(2), 0-0. doi:10.26650/ACIN.1251650 google scholar
Efeoğlu, E. (2022). Kablosuz Sinyal Gücünü Kullanarak İç Mekan Kullanıcı Lokalizasyonu için Karar Ağacı Algoritmalarının Karşılaştırılması. Acta Infologica, 0(0), 0-0. doi:10.26650/ACIN.1076352 google scholar
Etem, T., & Teke, M. (2024). Enhanced deep learning based decision support system for kidney tumour detection. BenchCouncil Transactions on Benchmarks, Standards and Evaluations, 4(2), 100174. doi:10.1016/J.TBENCH.2024.100174 google scholar
Garera, S., Provos, N., Chew, M., & Rubin, A. D. (2007). A framework for detection and measurement of phishing attacks. WORM’07 -Proceedings of the 2007 ACM Workshop on Recurring Malcode, 1-8. doi:10.1145/1314389.1314391 google scholar
GitHub - judger90/phishing_detection_tsne. (n.d.). Retrieved 19 September 2024, from https://github.com/judger90/phishing_detection_tsne google scholar
Gopali, S., Namin, A. S., Abri, F., & Jones, K. S. (2024). The Performance of Sequential Deep Learning Models in Detecting Phishing Websites Using Contextual Features of URLs. In SAC ’24: Proceedings ofthe 39th ACM/SIGAPP Symposium on Applied Computing (pp. 1064-1066). Association for Computing Machinery (ACM). doi:10.1145/3605098.3636164 google scholar
Güler, O., & Yücedağ, İ. (2022). Hand Gesture Recognition from 2D Images by Using Convolutional Capsule Neural Networks. Arabian Journal for Science and Engineering, 47(2), 1211-1225. doi:10.1007/S13369-021-05867-2/TABLES/8 google scholar
Jain, A. K., & Gupta, B. B. (2022). A survey of phishing attack techniques, defence mechanisms and open research challenges. Enterprise Information Systems, 16(4), 527-565. doi:10.1080/17517575.2021.1896786 google scholar
Jiang, D., Shi, X., Liang, Y., & Liu, H. (2024). Feature extraction technique based on Shapley value method and improved mRMR algorithm. Measurement, 237, 115190. doi:10.1016/J.MEASUREMENT.2024.115190 google scholar
Jishnu, K. S., & Arthi, B. (2023). Review of the effectiveness of machine learning based phishing prevention systems. AIP Conference Proceedings, 2917(1). doi:10.1063/5.0175593/2919402 google scholar
Prasad, A., & Chandra, S. (2024). PhiUSIIL: A diverse security profile empowered phishing URL detection framework based on similarity index and incremental learning. Computers & Security, 136, 103545. doi:10.1016/J.COSE.2023.103545 google scholar
Thakur, K., Ali, M. L., Obaidat, M. A., & Kamruzzaman, A. (2023). A Systematic Review on Deep-Learning-Based Phishing Email Detection. Electronics 2023, Vol. 12, Page 4545, 12(21), 4545. doi:10.3390/ELECTRONICS12214545 google scholar
Tülay, E. (2023). Detection of Orienting Response to Novel Sounds in Healthy Elderly Subjects: A Machine Learning Approach Using EEG Features. Acta Infologica, 0(0), 0-0. doi:10.26650/ACIN.1234106 google scholar
Türk, F., Lüy, M., & Barışçı, N. (2020). Kidney and Renal Tumor Segmentation Using a Hybrid V-Net-Based Model. Mathematics 2020, Vol. 8, Page 1772, 8(10), 1772. doi:10.3390/MATH8101772 google scholar
Yaman, O., & Tuncer, T. (2023). Plant Classification Method Using Histogram and Machine Learning for Smart Agriculture Applications. Acta Infologica, 0(0), 0-0. doi:10.26650/ACIN.1070261 google scholar
Yang, L., Zhang, J., Wang, X., Li, Z., Li, Z., & He, Y. (2021). An improved ELM-based and data preprocessing integrated approach for phishing detection considering comprehensive features. Expert Systems with Applications, 165, 113863. doi:10.1016/J.ESWA.2020.113863 google scholar

Advanced Phishing Detection: Leveraging t-SNE Feature Extraction and Machine Learning on a Comprehensive URL Dataset

Year 2024, Volume: 8 Issue: 2, 213 - 221, 31.12.2024

Taha Etem , Mustafa Teke

https://doi.org/10.26650/acin.1521835

Abstract

Phishing attacks continue to pose a major challenge in today’s digital world; thus, sophisticated detection techniques are required to address constantly changing tactics. In this paper, we have proposed an innovative method to identify phishing attempts using the extensive PhiUSIIL dataset. The proposed dataset comprises 134,850 legitimate URLs and 100,945 phishing URLs, providing a robust foundation for analysis. We applied the t-SNE technique for feature extraction, condensing the original 51 features into only 2, while preserving high detection accuracy. We evaluated several machine learning algorithms on both full and reduced datasets, including Logistic Regression, Naive Bayes, k-Nearest Neighbors (kNN), Decision Trees, and Random Forest. The Decision Tree algorithm showed the best performance on the original dataset, achieving 99.7% accuracy. Interestingly, the proposed kNN demonstrated remarkable results on feature-extracted data, achieving 99.2% accuracy. We observed significant improvements in Logistic Regression and Random Forest performance when using the feature-extracted dataset. The proposed method offers substantial benefits in terms of computational efficiency. The feature-extracted dataset requires less processing power; thus, it is well-suited for systems with limited resources. These findings pave the way for developing more powerful and flexible phishing detection systems that can identify and neutralize emerging threats in real-time scenarios.

Keywords

Machine learning , cybersecurity , feature extraction , data mining

References

Aburrous, M., Hossain, M. A., Dahal, K., & Thabtah, F. (2010). Intelligent phishing detection system for e-banking using fuzzy data mining. Expert Systems with Applications, 37(12), 7913-7921. doi:10.1016/J.ESWA.2010.04.044 google scholar
Adebowale, M. A., Lwin, K. T., & Hossain, M. A. (2019). Deep learning with convolutional neural network and long short-term memory for phishing detection. 2019 13th International Conference on Software, Knowledge, Information Management and Applications, SKIMA 2019. doi:10.1109/SKIMA47702.2019.8982427 google scholar
Alam, M. N., Sarma, D., Lima, F. F., Saha, I., Ulfath, R. E., & Hossain, S. (2020). Phishing Attacks Detection using Machine Learning Approach. 2020 Third International Conference on Smart Systems and Inventive Technology (ICSSIT), 1173-1179. doi:10.1109/ICSSIT48917.2020.9214225 google scholar
Alhudhaif, A., Almaslukh, B., Aseeri, A. O., Guler, O., & Polat, K. (2023). A novel nonlinear automated multi-class skin lesion detection system using soft-attention based convolutional neural networks. Chaos, Solitons & Fractals, 170, 113409. doi:10.1016/J.CHAOS.2023.113409 google scholar
Alsaç, A., Yenisey, M. M., Ganiz, M., Dagtekin, M., & Ulusinan, T. (2023). The Efficiency of Regularization Method on Model Success in Issue Type Prediction Problem. Acta Infologica, 7(2), 360-383. doi:10.26650/ACIN.1394019 google scholar
Atawneh, S., & Aljehani, H. (2023). Phishing Email Detection Model Using Deep Learning. Electronics 2023, Vol. 12, Page 4261, 12(20), 4261. doi:10.3390/ELECTRONICS12204261 google scholar
Bergholz, A., De Beer, J., Glahn, S., Moens, M. F., PaaB, G., & Strobel, S. (2010). New filtering approaches for phishing email. Journal of Computer Security, 18(1), 7-35. doi:10.3233/JCS-2010-0371 google scholar
Bibal, A., Delchevalerie, V., & Frenay, B. (2023). DT-SNE: t-SNE discrete visualizations as decision tree structures. Neurocomputing, 529, 101-112. doi:10.1016/J.NEUCOM.2023.01.073 google scholar
Bibi, H., Shah, S. R., Baig, M. M., Sharif, M. I., Mehmood, M., Akhtar, Z., & Siddique, K. (2024). Phishing Website Detection Using Improved Multilayered Convolutional Neural Networks. Journal of Computer Science, 20(9), 1069-1079. doi:10.3844/JCSSP.2024.1069.1079 google scholar
Buyrukoğlu, S., & Savaş, S. (2023). Stacked-Based Ensemble Machine Learning Model for Positioning Footballer. Arabian Journal for Science and Engineering, 48(2), 1371-1383. doi:10.1007/s13369-022-06857-8 google scholar
Divakaran, D. M., & Oest, A. (2022). Phishing Detection Leveraging Machine Learning and Deep Learning: A Review. IEEE Security and Privacy, 20(5), 86-95. doi:10.1109/MSEC.2022.3175225 google scholar
Doğruel, M., & Soner Kara, S. (2023). Determining the Happiness Class of Countries with Tree-Based Algorithms in Machine Learning. Acta Infologica, 7(2), 0-0. doi:10.26650/ACIN.1251650 google scholar
Efeoğlu, E. (2022). Kablosuz Sinyal Gücünü Kullanarak İç Mekan Kullanıcı Lokalizasyonu için Karar Ağacı Algoritmalarının Karşılaştırılması. Acta Infologica, 0(0), 0-0. doi:10.26650/ACIN.1076352 google scholar
Etem, T., & Teke, M. (2024). Enhanced deep learning based decision support system for kidney tumour detection. BenchCouncil Transactions on Benchmarks, Standards and Evaluations, 4(2), 100174. doi:10.1016/J.TBENCH.2024.100174 google scholar
Garera, S., Provos, N., Chew, M., & Rubin, A. D. (2007). A framework for detection and measurement of phishing attacks. WORM’07 -Proceedings of the 2007 ACM Workshop on Recurring Malcode, 1-8. doi:10.1145/1314389.1314391 google scholar
GitHub - judger90/phishing_detection_tsne. (n.d.). Retrieved 19 September 2024, from https://github.com/judger90/phishing_detection_tsne google scholar
Gopali, S., Namin, A. S., Abri, F., & Jones, K. S. (2024). The Performance of Sequential Deep Learning Models in Detecting Phishing Websites Using Contextual Features of URLs. In SAC ’24: Proceedings ofthe 39th ACM/SIGAPP Symposium on Applied Computing (pp. 1064-1066). Association for Computing Machinery (ACM). doi:10.1145/3605098.3636164 google scholar
Güler, O., & Yücedağ, İ. (2022). Hand Gesture Recognition from 2D Images by Using Convolutional Capsule Neural Networks. Arabian Journal for Science and Engineering, 47(2), 1211-1225. doi:10.1007/S13369-021-05867-2/TABLES/8 google scholar
Jain, A. K., & Gupta, B. B. (2022). A survey of phishing attack techniques, defence mechanisms and open research challenges. Enterprise Information Systems, 16(4), 527-565. doi:10.1080/17517575.2021.1896786 google scholar
Jiang, D., Shi, X., Liang, Y., & Liu, H. (2024). Feature extraction technique based on Shapley value method and improved mRMR algorithm. Measurement, 237, 115190. doi:10.1016/J.MEASUREMENT.2024.115190 google scholar
Jishnu, K. S., & Arthi, B. (2023). Review of the effectiveness of machine learning based phishing prevention systems. AIP Conference Proceedings, 2917(1). doi:10.1063/5.0175593/2919402 google scholar
Prasad, A., & Chandra, S. (2024). PhiUSIIL: A diverse security profile empowered phishing URL detection framework based on similarity index and incremental learning. Computers & Security, 136, 103545. doi:10.1016/J.COSE.2023.103545 google scholar
Thakur, K., Ali, M. L., Obaidat, M. A., & Kamruzzaman, A. (2023). A Systematic Review on Deep-Learning-Based Phishing Email Detection. Electronics 2023, Vol. 12, Page 4545, 12(21), 4545. doi:10.3390/ELECTRONICS12214545 google scholar
Tülay, E. (2023). Detection of Orienting Response to Novel Sounds in Healthy Elderly Subjects: A Machine Learning Approach Using EEG Features. Acta Infologica, 0(0), 0-0. doi:10.26650/ACIN.1234106 google scholar
Türk, F., Lüy, M., & Barışçı, N. (2020). Kidney and Renal Tumor Segmentation Using a Hybrid V-Net-Based Model. Mathematics 2020, Vol. 8, Page 1772, 8(10), 1772. doi:10.3390/MATH8101772 google scholar
Yaman, O., & Tuncer, T. (2023). Plant Classification Method Using Histogram and Machine Learning for Smart Agriculture Applications. Acta Infologica, 0(0), 0-0. doi:10.26650/ACIN.1070261 google scholar
Yang, L., Zhang, J., Wang, X., Li, Z., Li, Z., & He, Y. (2021). An improved ELM-based and data preprocessing integrated approach for phishing detection considering comprehensive features. Expert Systems with Applications, 165, 113863. doi:10.1016/J.ESWA.2020.113863 google scholar

There are 27 citations in total.

Details

Primary Language	English
Subjects	Machine Learning (Other)
Journal Section	Research Article
Authors	Taha Etem 0000-0003-1419-5008 Mustafa Teke 0000-0002-7262-4918
Submission Date	July 24, 2024
Acceptance Date	December 11, 2024
Publication Date	December 31, 2024
Published in Issue	Year 2024 Volume: 8 Issue: 2

Cite

APA	Etem, T., & Teke, M. (2024). Advanced Phishing Detection: Leveraging t-SNE Feature Extraction and Machine Learning on a Comprehensive URL Dataset. Acta Infologica, 8(2), 213-221. https://doi.org/10.26650/acin.1521835
AMA	Etem T, Teke M. Advanced Phishing Detection: Leveraging t-SNE Feature Extraction and Machine Learning on a Comprehensive URL Dataset. ACIN. December 2024;8(2):213-221. doi:10.26650/acin.1521835
Chicago	Etem, Taha, and Mustafa Teke. “Advanced Phishing Detection: Leveraging T-SNE Feature Extraction and Machine Learning on a Comprehensive URL Dataset”. Acta Infologica 8, no. 2 (December 2024): 213-21. https://doi.org/10.26650/acin.1521835.
EndNote	Etem T, Teke M (December 1, 2024) Advanced Phishing Detection: Leveraging t-SNE Feature Extraction and Machine Learning on a Comprehensive URL Dataset. Acta Infologica 8 2 213–221.
IEEE	T. Etem and M. Teke, “Advanced Phishing Detection: Leveraging t-SNE Feature Extraction and Machine Learning on a Comprehensive URL Dataset”, ACIN, vol. 8, no. 2, pp. 213–221, 2024, doi: 10.26650/acin.1521835.
ISNAD	Etem, Taha - Teke, Mustafa. “Advanced Phishing Detection: Leveraging T-SNE Feature Extraction and Machine Learning on a Comprehensive URL Dataset”. Acta Infologica 8/2 (December2024), 213-221. https://doi.org/10.26650/acin.1521835.
JAMA	Etem T, Teke M. Advanced Phishing Detection: Leveraging t-SNE Feature Extraction and Machine Learning on a Comprehensive URL Dataset. ACIN. 2024;8:213–221.
MLA	Etem, Taha and Mustafa Teke. “Advanced Phishing Detection: Leveraging T-SNE Feature Extraction and Machine Learning on a Comprehensive URL Dataset”. Acta Infologica, vol. 8, no. 2, 2024, pp. 213-21, doi:10.26650/acin.1521835.
Vancouver	Etem T, Teke M. Advanced Phishing Detection: Leveraging t-SNE Feature Extraction and Machine Learning on a Comprehensive URL Dataset. ACIN. 2024;8(2):213-21.

Article Files

Full Text