Research Article
BibTex RIS Cite

Classification of Websites Based on Visual and Textual Data Using a Hybrid Deep Learning Model: DeepCLA-Web

Year 2025, Volume: 7 Issue: 2, 66 - 79, 31.08.2025
https://doi.org/10.46740/alku.1639372

Abstract

This study proposes a hybrid deep learning model that processes both textual and visual content for web site classification. The amount of accessible information services on the internet is increasing daily, and within this intense data flow, accurately classifying web sites based on their content is crucial. To develop a deep learning model capable of performing this classification for users, 430 website addresses were selected from the UT1 Blacklist, published by Université Toulouse, and divided into three categories: shopping, news, and gaming. The proposed model uses Long Short-Term Memory (LSTM) for processing the textual content of websites and Convolutional Neural Network (CNN) for analyzing visual data. An Artificial Neural Network (ANN) combining the outputs of LSTM and CNN models performs the final classification. The performance of the proposed website classification model (DeepCLA-Web), which processes visual data with CNN, text with LSTM, and makes the final decision with ANN, was compared to a CNN model using only visual data and an LSTM model using only textual data based on commonly used metrics in the literature. The CNN model achieved an accuracy of 59.22%, the LSTM model 75.85%, while the proposed DeepCLA-Web reached 80.89% accuracy.

References

  • [1] M. S. Kurt and E. Yücel, "Web page classification with deep learning methods," Bursa Uludağ University Journal of The Faculty of Engineering, vol. 27, no. 1, pp. 191–202, 2022, doi: 10.17482/uumfd.891038.
  • [2] Y. Yu, "Web page classification algorithm based on deep learning," Computational Intelligence and Neuroscience, vol. 2022, Art. no. 9534918, 2022, doi: 10.1155/2022/9534918.
  • [3] D. López-Sánchez, A. González Arrieta, and J. M. Corchado, "Visual content-based web page categorization with deep transfer learning and metric learning," Neurocomputing, vol. 338, pp. 418–431, 2019, doi: 10.1016/j.neucom.2018.08.086.
  • [4] M. Hashemi, "Web page classification: A survey of perspectives, gaps, and future directions," Multimedia Tools and Applications, vol. 79, pp. 11921–11945, 2020, doi: 10.1007/s11042-019-08373-8.
  • [5] R. Bruni and G. Bianchi, "Web site categorization: A formal approach and robustness analysis in the case of e-commerce detection" Expert Systems with Applications, vol. 142, p. 113001, 2020, doi: 10.1016/j.eswa.2019.113001.
  • [6] D. Cohen, O. Naim, E. Toch, and I. Ben-Gal, "Web site categorization via design attribute learning," Computers & Security, vol. 107, p. 102312, 2021, doi: 10.1016/j.cose.2021.102312.
  • [7] V. K. Bhalla and N. Kumar, "An efficient scheme for automatic web pages categorization using the support vector machine," New Review of Hypermedia and Multimedia, vol. 22, no. 3, pp. 223–242, 2016, doi: 10.1080/13614568.2016.1152316.
  • [8] E. Buber and B. Diri, "Web page classification using RNN," Procedia Computer Science, vol. 154, pp. 62–72, 2019, doi: 10.1016/j.procs.2019.06.011.
  • [9] S. H. Apandi, J. Sallim, and R. Mohamed, "A survey on technique for solving web page classification problem," IOP Conference Series: Materials Science and Engineering, vol. 769, no. 1, p. 012036, 2020, doi: 10.1088/1757-899X/769/1/012036.
  • [10] S. H. Apandi, J. Sallim, R. Mohamed, and N. Ahmad, "Automatic topic-based web page classification using deep learning," International Journal on Informatics Visualization, vol. 7, no. 3-2, pp. 2108–2114, 2023.
  • [11] S. H. Apandi, J. Sallim, R. Mohammed, and A. Madbouly, "Web page classification using convolutional neural network towards eliminating internet addiction," in Proc. 2021 Int. Conf. Software Engineering & Computer Systems and 4th Int. Conf. Computational Science and Information Management (ICSECS-ICOCSIM), 2021, doi: 10.1109/ICSECS52883.2021.00034.
  • [12] P. Prajapati and P. V. Nainwani, "Comparative study of web page classification approaches," International Journal of Computer Applications, vol. 179, no. 45, pp. 6–9, 2018, doi: 10.5120/ijca2018916994.
  • [13] DSI Université Toulouse Capitole, "The blacklists of the University of Toulouse Capitole," Database, [Online]. Available: https://dsi.ut-capitole.fr/blacklists/index_en.php. [Accessed: Nov. 13, 2024].
  • [14] A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks," in Advances in Neural Information Processing Systems, vol. 25, pp. 1097–1105, 2012, doi: 10.1145/3065386.
  • [15] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, MIT Press, 2016.
  • [16] S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997, doi: 10.1162/neco.1997.9.8.1735.
  • [17] F. A. Gers, J. Schmidhuber, and F. Cummins, "Learning to forget: Continual prediction with LSTM," Neural Computation, vol. 12, no. 10, pp. 2451–2471, 2000.
  • [18] D. M. W. Powers, "Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation," Journal of Machine Learning Technologies, vol. 2, no. 1, pp. 37–63, 2011.
  • [19] H. Bozcu and B. Çubukçu, "Deep learning-based damage assessment in cherry leaves," Journal of Innovative Science and Engineering (JISE), vol. 8, no. 2, pp. 160–178, 2024, doi: 10.38088/jise.1455860.
  • [20] M. Aybar, U. Talaş, and B. Çubukçu, "Transfer öğrenme modelleri ile elma yapraklarında hastalık tespiti," ESTUDAM Bilişim, vol. 5, no. 2, pp. 57–63, 2024, doi: 10.53608/estudambilisim.1556425.

Hibrit Derin Öğrenme Modeli ile Web Sitelerinin Görsel ve Metinsel Verilere Dayalı Sınıflandırılması: DeepCLA-Web

Year 2025, Volume: 7 Issue: 2, 66 - 79, 31.08.2025
https://doi.org/10.46740/alku.1639372

Abstract

Bu çalışmada, web sitelerinin sınıflandırılması için metin ve görsel içerikleri işleyen hibrit bir derin öğrenme modeli önerilmektedir. İnternette erişilebilen bilgi hizmetlerinin miktarı her geçen gün artmakta olup, yoğun veri akışı içinde web sitelerinin içeriğe göre doğru sınıflandırılması önem arz etmektedir. Kullanıcılar için bu işlemi yapabilecek bir derin öğrenme modeli oluşturmak amacıyla, Université Toulouse tarafından yayınlanan UT1 Blacklist içerisinden 430 web adresi seçilmiş ve bu adresler alışveriş, haber ve oyun olmak üzere üç kategoriye ayrılmıştır. Önerilen model, web sitelerinin metin içeriklerini işlemek için Uzun Kısa Süreli Bellek (LSTM) kullanırken, görüntü verilerini analiz etmek için Evrişimli Sinir Ağı (CNN) kullanmaktadır. LSTM ve CNN modellerinin çıktısını birleştiren bir Yapay Sinir Ağı (ANN) nihai sınıflandırmayı gerçekleştirmektedir. CNN ile görsel, LSTM ile metin işleyerek ANN ile nihai karar veren, önerilen web sitesi sınıflandırma modelinin (DeepCLA-Web) başarımı, sadece görsel verileri kullanan CNN modeli ve sadece metin verileri kullanan LSTM modeli ile literatürde sık kullanılan metrikler üzerinden kıyaslanmıştır. CNN modeli %59,22, LSTM modeli %75,85 doğruluk oranına ulaşırken, önerilen DeepCLA-Web %80,89 doğruluk oranına ulaşmıştır.

References

  • [1] M. S. Kurt and E. Yücel, "Web page classification with deep learning methods," Bursa Uludağ University Journal of The Faculty of Engineering, vol. 27, no. 1, pp. 191–202, 2022, doi: 10.17482/uumfd.891038.
  • [2] Y. Yu, "Web page classification algorithm based on deep learning," Computational Intelligence and Neuroscience, vol. 2022, Art. no. 9534918, 2022, doi: 10.1155/2022/9534918.
  • [3] D. López-Sánchez, A. González Arrieta, and J. M. Corchado, "Visual content-based web page categorization with deep transfer learning and metric learning," Neurocomputing, vol. 338, pp. 418–431, 2019, doi: 10.1016/j.neucom.2018.08.086.
  • [4] M. Hashemi, "Web page classification: A survey of perspectives, gaps, and future directions," Multimedia Tools and Applications, vol. 79, pp. 11921–11945, 2020, doi: 10.1007/s11042-019-08373-8.
  • [5] R. Bruni and G. Bianchi, "Web site categorization: A formal approach and robustness analysis in the case of e-commerce detection" Expert Systems with Applications, vol. 142, p. 113001, 2020, doi: 10.1016/j.eswa.2019.113001.
  • [6] D. Cohen, O. Naim, E. Toch, and I. Ben-Gal, "Web site categorization via design attribute learning," Computers & Security, vol. 107, p. 102312, 2021, doi: 10.1016/j.cose.2021.102312.
  • [7] V. K. Bhalla and N. Kumar, "An efficient scheme for automatic web pages categorization using the support vector machine," New Review of Hypermedia and Multimedia, vol. 22, no. 3, pp. 223–242, 2016, doi: 10.1080/13614568.2016.1152316.
  • [8] E. Buber and B. Diri, "Web page classification using RNN," Procedia Computer Science, vol. 154, pp. 62–72, 2019, doi: 10.1016/j.procs.2019.06.011.
  • [9] S. H. Apandi, J. Sallim, and R. Mohamed, "A survey on technique for solving web page classification problem," IOP Conference Series: Materials Science and Engineering, vol. 769, no. 1, p. 012036, 2020, doi: 10.1088/1757-899X/769/1/012036.
  • [10] S. H. Apandi, J. Sallim, R. Mohamed, and N. Ahmad, "Automatic topic-based web page classification using deep learning," International Journal on Informatics Visualization, vol. 7, no. 3-2, pp. 2108–2114, 2023.
  • [11] S. H. Apandi, J. Sallim, R. Mohammed, and A. Madbouly, "Web page classification using convolutional neural network towards eliminating internet addiction," in Proc. 2021 Int. Conf. Software Engineering & Computer Systems and 4th Int. Conf. Computational Science and Information Management (ICSECS-ICOCSIM), 2021, doi: 10.1109/ICSECS52883.2021.00034.
  • [12] P. Prajapati and P. V. Nainwani, "Comparative study of web page classification approaches," International Journal of Computer Applications, vol. 179, no. 45, pp. 6–9, 2018, doi: 10.5120/ijca2018916994.
  • [13] DSI Université Toulouse Capitole, "The blacklists of the University of Toulouse Capitole," Database, [Online]. Available: https://dsi.ut-capitole.fr/blacklists/index_en.php. [Accessed: Nov. 13, 2024].
  • [14] A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks," in Advances in Neural Information Processing Systems, vol. 25, pp. 1097–1105, 2012, doi: 10.1145/3065386.
  • [15] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, MIT Press, 2016.
  • [16] S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997, doi: 10.1162/neco.1997.9.8.1735.
  • [17] F. A. Gers, J. Schmidhuber, and F. Cummins, "Learning to forget: Continual prediction with LSTM," Neural Computation, vol. 12, no. 10, pp. 2451–2471, 2000.
  • [18] D. M. W. Powers, "Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation," Journal of Machine Learning Technologies, vol. 2, no. 1, pp. 37–63, 2011.
  • [19] H. Bozcu and B. Çubukçu, "Deep learning-based damage assessment in cherry leaves," Journal of Innovative Science and Engineering (JISE), vol. 8, no. 2, pp. 160–178, 2024, doi: 10.38088/jise.1455860.
  • [20] M. Aybar, U. Talaş, and B. Çubukçu, "Transfer öğrenme modelleri ile elma yapraklarında hastalık tespiti," ESTUDAM Bilişim, vol. 5, no. 2, pp. 57–63, 2024, doi: 10.53608/estudambilisim.1556425.
There are 20 citations in total.

Details

Primary Language Turkish
Subjects Deep Learning
Journal Section Makaleler
Authors

Harun Şeker 0000-0002-9205-4035

Burakhan Çubukçu 0000-0003-0480-1254

Early Pub Date August 26, 2025
Publication Date August 31, 2025
Submission Date February 13, 2025
Acceptance Date May 4, 2025
Published in Issue Year 2025 Volume: 7 Issue: 2

Cite

APA Şeker, H., & Çubukçu, B. (2025). Hibrit Derin Öğrenme Modeli ile Web Sitelerinin Görsel ve Metinsel Verilere Dayalı Sınıflandırılması: DeepCLA-Web. ALKÜ Fen Bilimleri Dergisi, 7(2), 66-79. https://doi.org/10.46740/alku.1639372
AMA Şeker H, Çubukçu B. Hibrit Derin Öğrenme Modeli ile Web Sitelerinin Görsel ve Metinsel Verilere Dayalı Sınıflandırılması: DeepCLA-Web. ALKÜ Fen Bilimleri Dergisi. August 2025;7(2):66-79. doi:10.46740/alku.1639372
Chicago Şeker, Harun, and Burakhan Çubukçu. “Hibrit Derin Öğrenme Modeli Ile Web Sitelerinin Görsel Ve Metinsel Verilere Dayalı Sınıflandırılması: DeepCLA-Web”. ALKÜ Fen Bilimleri Dergisi 7, no. 2 (August 2025): 66-79. https://doi.org/10.46740/alku.1639372.
EndNote Şeker H, Çubukçu B (August 1, 2025) Hibrit Derin Öğrenme Modeli ile Web Sitelerinin Görsel ve Metinsel Verilere Dayalı Sınıflandırılması: DeepCLA-Web. ALKÜ Fen Bilimleri Dergisi 7 2 66–79.
IEEE H. Şeker and B. Çubukçu, “Hibrit Derin Öğrenme Modeli ile Web Sitelerinin Görsel ve Metinsel Verilere Dayalı Sınıflandırılması: DeepCLA-Web”, ALKÜ Fen Bilimleri Dergisi, vol. 7, no. 2, pp. 66–79, 2025, doi: 10.46740/alku.1639372.
ISNAD Şeker, Harun - Çubukçu, Burakhan. “Hibrit Derin Öğrenme Modeli Ile Web Sitelerinin Görsel Ve Metinsel Verilere Dayalı Sınıflandırılması: DeepCLA-Web”. ALKÜ Fen Bilimleri Dergisi 7/2 (August2025), 66-79. https://doi.org/10.46740/alku.1639372.
JAMA Şeker H, Çubukçu B. Hibrit Derin Öğrenme Modeli ile Web Sitelerinin Görsel ve Metinsel Verilere Dayalı Sınıflandırılması: DeepCLA-Web. ALKÜ Fen Bilimleri Dergisi. 2025;7:66–79.
MLA Şeker, Harun and Burakhan Çubukçu. “Hibrit Derin Öğrenme Modeli Ile Web Sitelerinin Görsel Ve Metinsel Verilere Dayalı Sınıflandırılması: DeepCLA-Web”. ALKÜ Fen Bilimleri Dergisi, vol. 7, no. 2, 2025, pp. 66-79, doi:10.46740/alku.1639372.
Vancouver Şeker H, Çubukçu B. Hibrit Derin Öğrenme Modeli ile Web Sitelerinin Görsel ve Metinsel Verilere Dayalı Sınıflandırılması: DeepCLA-Web. ALKÜ Fen Bilimleri Dergisi. 2025;7(2):66-79.