Research Article

Data correlation matrix-based spam URL detection using machine learning algorithms

Number: 056 March 31, 2024
EN

Data correlation matrix-based spam URL detection using machine learning algorithms

Abstract

In recent years, the widespread availability of internet access has brought both advantages and disadvantages. Users now enjoy numerous benefits, including unlimited access to vast amounts of information and seamless communication with others. However, this accessibility also exposes users to various threats, including malicious software and deceptive practices, leading to victimization of many individuals. Common issues encountered include spam emails, fake websites, and phishing attempts. Given the essential nature of internet usage in contemporary society, the development of systems to protect users from such malicious activities has become imperative. Accordingly, this study utilized eight prominent machine learning algorithms to identify spam URLs using a large dataset. Since the dataset only contained URL information and spam classification, additional feature extractions such as URL length and the number of digits were necessary. The inclusion of such features enhances decision-making processes within the framework of machine learning, resulting in more efficient detection. As the effectiveness of feature extraction significantly impacts the results of the methods, the study initially conducted feature extraction and trained models based on the weight of features. This paper proposes a data correlated matrix approach for spam URL detection using machine learning algorithms. The distinctive aspect of this study lies in the feature extraction process applied to the dataset, aimed at discerning the most impactful features, and subsequently training models while considering the weighting of these features. The entire dataset was used without any reduction in data. Experimental findings indicate that tree-based machine learning algorithms yield superior results. Among all applied methods, the Random Forest approach achieved the highest success rate, with a detection rate of 96.33% for the non-spam class. Additionally, a combined and weighted calculation method yielded an accuracy of 94.16% for both spam and non-spam data.

Keywords

Thanks

Makale yayım sürecinde emeği geçecek olan tüm hocalara saygılar

References

  1. [1] R. S. Arslan, “Kötücül Web Sayfalarının Tespitinde Doc2Vec Modeli ve Makine Öğrenmesi Yaklaşımı,” European Journal of Science and Technology, no. 27, pp. 792–801, 2021, doi: 10.31590/ejosat.981450.
  2. [2] D. Sahoo, C. Liu, and S. C. H. Hoi, “Malicious URL Detection using Machine Learning: A Survey,” ArXiv, vol. abs/1701.0, 2017.
  3. [3] P. Kolari, A. Java, T. Finin, T. Oates, and A. Joshi, “Detecting spam blogs: A machine learning approach,” Proceedings of the National Conference on Artificial Intelligence, vol. 2, pp. 1351–1356, 2006.
  4. [4] F. O. Catak, K. Sahinbas, and V. Dörtkarde\cs, “Malicious URL detection using machine learning,” Artificial intelligence paradigms for smart cyber-physical systems, IGI Global, pp. 160–180, 2021.
  5. [5] A. Begum and S. Badugu, “A study of malicious url detection using machine learning and heuristic approaches,” Advances in Decision Sciences, Image Processing, Security and Computer Vision, Springer, pp. 587–597, 2020.
  6. [6] S. Kumar, X. Gao, I. Welch, and M. Mansoori, “A machine learning based web spam filtering approach,” 2016 IEEE 30th International Conference on Advanced Information Networking and Applications (AINA), 2016, pp. 973–980.
  7. [7] P. Parekh, K. Parmar, and P. Awate, “Spam URL detection and image spam filtering using machine learning,” Computer Engineering, 2018.
  8. [8] M. Aljabri et al., “Detecting Malicious URLs Using Machine Learning Techniques: Review and Research Directions,” IEEE Access, vol. 10, no. October, pp. 121395–121417, 2022, doi: 10.1109/ACCESS.2022.3222307.

Details

Primary Language

English

Subjects

Machine Learning (Other), Artificial Intelligence (Other)

Journal Section

Research Article

Publication Date

March 31, 2024

Submission Date

January 20, 2024

Acceptance Date

February 21, 2024

Published in Issue

Year 2024 Number: 056

APA
Akar, F. (2024). Data correlation matrix-based spam URL detection using machine learning algorithms. Journal of Scientific Reports-A, 056, 56-69. https://doi.org/10.59313/jsr-a.1422913
AMA
1.Akar F. Data correlation matrix-based spam URL detection using machine learning algorithms. JSR-A. 2024;(056):56-69. doi:10.59313/jsr-a.1422913
Chicago
Akar, Funda. 2024. “Data Correlation Matrix-Based Spam URL Detection Using Machine Learning Algorithms”. Journal of Scientific Reports-A, nos. 056: 56-69. https://doi.org/10.59313/jsr-a.1422913.
EndNote
Akar F (March 1, 2024) Data correlation matrix-based spam URL detection using machine learning algorithms. Journal of Scientific Reports-A 056 56–69.
IEEE
[1]F. Akar, “Data correlation matrix-based spam URL detection using machine learning algorithms”, JSR-A, no. 056, pp. 56–69, Mar. 2024, doi: 10.59313/jsr-a.1422913.
ISNAD
Akar, Funda. “Data Correlation Matrix-Based Spam URL Detection Using Machine Learning Algorithms”. Journal of Scientific Reports-A. 056 (March 1, 2024): 56-69. https://doi.org/10.59313/jsr-a.1422913.
JAMA
1.Akar F. Data correlation matrix-based spam URL detection using machine learning algorithms. JSR-A. 2024;:56–69.
MLA
Akar, Funda. “Data Correlation Matrix-Based Spam URL Detection Using Machine Learning Algorithms”. Journal of Scientific Reports-A, no. 056, Mar. 2024, pp. 56-69, doi:10.59313/jsr-a.1422913.
Vancouver
1.Funda Akar. Data correlation matrix-based spam URL detection using machine learning algorithms. JSR-A. 2024 Mar. 1;(056):56-69. doi:10.59313/jsr-a.1422913