WEB PAGE CLASSIFICATION WITH DEEP LEARNING METHODS

Mehmet Salih Kurt; Eylem Yücel Demirel

doi:10.17482/uumfd.891038

Araştırma Makalesi

Derin Öğrenme Yöntemleri ile Web Sayfası Sınıflandırma

Yıl 2022, Cilt: 27 Sayı: 1, 191 - 204, 30.04.2022

Mehmet Salih Kurt , Eylem Yücel Demirel

https://doi.org/10.17482/uumfd.891038

Cited By: 1

Öz

Günümüzde bilgiye erişmek için internet ağı üzerinde milyonlarca web sitesi yaygın olarak kullanılmaktadır. Sayıları her geçen gün artan web sayfalarının daha etkin kullanılabilmesi için iyi bir şekilde kategorize edilmeleri önem kazanmıştır. Bu çalışmada, web sayfalarını yüksek doğrulukta sınıflandırabilen ikili ve çok sınıflı sınıflandırma modelleri oluşturulmuştur. Bu çalışmada, Açık Dizin Projesindeki (ODP) İngilizce web sayfalarının URL'leri ve kategorileri kullanıldı. Web sayfası metinleri URL bilgilerinden çekilerek eğitim veri kümesi oluşturuldu. Bildiğimiz kadarıyla bu, Türkçe için ilk kapsamlı web sayfası sınıflandırma veri setidir. Bu çalışmada, metin sınıflandırmada etkili olan Evrişimsel Sinir Ağı (CNN), Uzun Kısa Süreli Bellek (LSTM) ve Geçitli Tekrarlayan Birim (GRU) derin öğrenme yöntemleri kullanılmıştır. Metin sınıflandırma çalışmalarında özellik çıkarımı için yaygın olarak kullanılan n-gram yaklaşımları yerine kelime temsilleri kullanılmıştır. Bu çalışmada derin öğrenme modelleri için hiperparametre optimizasyonu yapılmıştır. En iyi parametrelerle ikili ve çok sınıflı sınıflandırma modelleri oluşturulmuştur. İkili sınıflandırma modelleri başka bir çalışmanın sonuçlarıyla ve çok sınıflı sınıflandırma modelleri kendi aralarında karşılaştırılmıştır. Tüm modellerin performansları eğitim süreleri ve f1 puanları dikkate alınarak incelenmiştir.

Anahtar Kelimeler

Web Sayfası Sınıflandırma , Derin Öğrenme , CNN , LSTM , GRU

Kaynakça

1. Auli, M., Galley, M., Quirk, C., and Zweig, G. (2013). Joint language and translation modeling with recurrent neural networks. In Proceedings of EMNLP, pages 1044–1054.
2. Baykan, E., Henzinger, M., Marian, L., and Weber, I. (2009). Purely url based topic classification. In Proceedings of the 18th international conference on World wide web, pages 1109–1110. doi:10.1145/1526709.1526880.
3. Baykan, E., Henzinger, M., Marian, L., and Weber, I. (2011). A comprehensive study of features and algorithms for url-based topic classification. ACM Transactions on the Web. doi:10.1145/1993053.1993057
4. Baykan, E., Henzinger, M., Marian, L., and Weber, I. (2013). A comprehensive study of techniques for url-based web page language classification. ACM Transactions on the Web. doi:10.1145/2435215.2435218
5. Bengio, Y., Simard, P., and Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, pages 157–166. doi:10.1109/72.279181
6. Chen, C. M., Lee, H. M., and Chang, Y. J. (2009). Two novel feature selection approaches for web page classification. Expert Systems with Applications. doi:10.1016/j.eswa.2007.09.008
7. Chen, C. M., Lee, H. M., and Tan, C. C. (2006). An intelligent web-page classifier with fair feature-subset selection. Engineering Applications of Artificial Intelligence. doi:10.1109/NAFIPS.2001.944285
8. Chen, R. C. and Hsieh, C. H. (2006). Web page classification based on a support vector machine using a weighted vote schema. Expert Systems with Applications. doi:10.1016/j.eswa.2005.09.079
9. Cho, K., van Merrienboer, B., Bahdanau, D., and Bengio, Y. (2014). On the properties of neural machine translation: Encoder-decoder approaches. Syntax, Semantics and Structure in Statistical Translation.
10. Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. NIPS Deep Learning and Representation Learning Workshop.
11. Chung, Y., Toyoda, M., and Kitsugeregawa, M. (2010). Topic classification of spam host based on urls. In Proceedings of the Forum on Data Engineering and Information Management (DEIM).
12. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuglu, K., and Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research 12, pages 2493–2537.
13. Dumais, S. and Chen, H. (2000). Hierarchical classification of web content. In Proceedings of SIGIR-00, 23rd ACM International Conference on Research and Development in Information Retrieval, pages 256–263. doi:10.1145/345508.345593
14. Hernandez, I., Rivero, C. R., Ruiz, D., and Corchuelo, R. (2014). Cala: An unsupervised url-based web page classification system. Knowledge-Based Systems. doi:10.1016/j.knosys.2013.12.019
15. Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), pages 1735–1780. doi:10.1162/neco.1997.9.8.1735
16. Kan, M. Y. (2004). Web page classification without the web page. Proceedings of the 13th International World Wide Web Conference on Alternate Track Papers and Posters, pages 262–263. doi:10.1145/1013367.1013426
17. Kan, M. Y. and Thi, H. O. N. (2005). Fast web page classification using url features. In Proceedings of the 14th ACM International Conference on Information and Knowledge Management (CIKM ’05), pages 325–326. doi:10.1145/1099554.1099649
18. Kim, Y. (2014). Convolutional neural networks for sentence classification. In Proceedings of EMNLP. doi:10.3115/v1/D14-1181
19. Kwon, O. W. and Lee, J. H. (2003). Text categorization based on k-nearest neighbor approach for web site classification. Information Processing and Management 39. doi:10.1016/S0306-4573(02)00022-5
20. Lai, Y. S. and Wu, C. H. (2002). Meaningful term extraction and discriminative term selection in text categorization via unknown-word methodology. ACM Transactions on Asian Language Information Processing (TALIP), pages 36–64. doi:10.1145/595576.595579
21. Lee, J. H., Yeh, W. C., and Chuang, M. C. (2015). Web page classification based on a simplified swarm optimization. Applied Mathematics and Computation. doi:10.1016/j.amc.2015.07.120
22. Liu, S., Yang, N., Li, M., and Zhou, M. (2014). A recursive recurrent neural network for statistical machine translation. In Proceedings of the Association for Computational Linguistics. doi:10.3115/v1/P14-1140
23. Nicolov, N. and Salvetti, F. (2007). Efficient spam analysis for weblogs through url segmentation. In RANLP, volume 292 of Current Issues in Linguistic Theory (CILT). doi:10.1075/cilt.292.17nic
24. Ozel, S. A. (2011). A web page classification system based on a genetic algorithm using tagged-terms as features. Expert Systems with Applications. doi:10.1016/j.eswa.2010.08.126
25. Pennington, J., Socher, R., and Manning, C. D. (2014). Glove: Global vectors for word representation. Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014). doi:10.3115/v1/D14-1162
26. Selamat, A. and Omatu, S. (2004). Web page feature selection and classification using neural networks. Information Sciences. doi:10.1016/j.ins.2003.03.003
27. Sun, A., Liu, Y., and Lim, E. P. (2011). Web classification of conceptual entities using co-training. Expert Systems with Applications. do:10.1016/j.eswa.2011.03.010
28. Sutskever, I., Vinyals, O., and Le, Q. (2014). Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems (NIPS 2014).
29. Tsukada, M., Washio, T., and Motoda, H. (2001). Automatic web-page classification by using machine learning methods. Web Intelligence: Research and Development, LNAI 2198, pages 303–313. doi:10.1007/3- 540-45490-X_36
30. Vonitsanou, M., Kozanidis, L., and Stamou, S. (2011). Keywords identification within greek urls. Polibits 43, pages 75–80.

WEB PAGE CLASSIFICATION WITH DEEP LEARNING METHODS

Yıl 2022, Cilt: 27 Sayı: 1, 191 - 204, 30.04.2022

Mehmet Salih Kurt , Eylem Yücel Demirel

https://doi.org/10.17482/uumfd.891038

Cited By: 1

Öz

Today, millions of websites on the Internet are widely used to access information. For effective use of web pages with increasing numbers every day, they need to be well classified. In this study, binary and multi-class classification models have been created which can classify web pages with high accuracy. In our experiments, URLs and categories of English web pages in the Open Directory Project (ODP) were used. Training dataset was created by pulling web page texts from URL information. To our knowledge, this is the first comprehensive web page classification dataset for Turkish. In this study, Convolutional Neural Network (CNN), Long Short Term Memory (LSTM) and Gated Recurrent Unit (GRU) deep learning methods which are effective in text classification are used. Word embedding was used instead of n-gram approaches commonly used for feature extraction in text classification studies. In this study, hyper-parameter optimization was performed for deep learning models. Binary and multi-class classification models were created with the best parameters. Binary classification models were compared with the results of another study, and multi-class classification models were compared with each other. The performances of all models were examined by considering their training time and f1 scores.

Anahtar Kelimeler

Web Page Classification , Deep Learning , CNN , LSTM , GRU

Kaynakça

1. Auli, M., Galley, M., Quirk, C., and Zweig, G. (2013). Joint language and translation modeling with recurrent neural networks. In Proceedings of EMNLP, pages 1044–1054.
2. Baykan, E., Henzinger, M., Marian, L., and Weber, I. (2009). Purely url based topic classification. In Proceedings of the 18th international conference on World wide web, pages 1109–1110. doi:10.1145/1526709.1526880.
3. Baykan, E., Henzinger, M., Marian, L., and Weber, I. (2011). A comprehensive study of features and algorithms for url-based topic classification. ACM Transactions on the Web. doi:10.1145/1993053.1993057
4. Baykan, E., Henzinger, M., Marian, L., and Weber, I. (2013). A comprehensive study of techniques for url-based web page language classification. ACM Transactions on the Web. doi:10.1145/2435215.2435218
5. Bengio, Y., Simard, P., and Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, pages 157–166. doi:10.1109/72.279181
6. Chen, C. M., Lee, H. M., and Chang, Y. J. (2009). Two novel feature selection approaches for web page classification. Expert Systems with Applications. doi:10.1016/j.eswa.2007.09.008
7. Chen, C. M., Lee, H. M., and Tan, C. C. (2006). An intelligent web-page classifier with fair feature-subset selection. Engineering Applications of Artificial Intelligence. doi:10.1109/NAFIPS.2001.944285
8. Chen, R. C. and Hsieh, C. H. (2006). Web page classification based on a support vector machine using a weighted vote schema. Expert Systems with Applications. doi:10.1016/j.eswa.2005.09.079
9. Cho, K., van Merrienboer, B., Bahdanau, D., and Bengio, Y. (2014). On the properties of neural machine translation: Encoder-decoder approaches. Syntax, Semantics and Structure in Statistical Translation.
10. Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. NIPS Deep Learning and Representation Learning Workshop.
11. Chung, Y., Toyoda, M., and Kitsugeregawa, M. (2010). Topic classification of spam host based on urls. In Proceedings of the Forum on Data Engineering and Information Management (DEIM).
12. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuglu, K., and Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research 12, pages 2493–2537.
13. Dumais, S. and Chen, H. (2000). Hierarchical classification of web content. In Proceedings of SIGIR-00, 23rd ACM International Conference on Research and Development in Information Retrieval, pages 256–263. doi:10.1145/345508.345593
14. Hernandez, I., Rivero, C. R., Ruiz, D., and Corchuelo, R. (2014). Cala: An unsupervised url-based web page classification system. Knowledge-Based Systems. doi:10.1016/j.knosys.2013.12.019
15. Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), pages 1735–1780. doi:10.1162/neco.1997.9.8.1735
16. Kan, M. Y. (2004). Web page classification without the web page. Proceedings of the 13th International World Wide Web Conference on Alternate Track Papers and Posters, pages 262–263. doi:10.1145/1013367.1013426
17. Kan, M. Y. and Thi, H. O. N. (2005). Fast web page classification using url features. In Proceedings of the 14th ACM International Conference on Information and Knowledge Management (CIKM ’05), pages 325–326. doi:10.1145/1099554.1099649
18. Kim, Y. (2014). Convolutional neural networks for sentence classification. In Proceedings of EMNLP. doi:10.3115/v1/D14-1181
19. Kwon, O. W. and Lee, J. H. (2003). Text categorization based on k-nearest neighbor approach for web site classification. Information Processing and Management 39. doi:10.1016/S0306-4573(02)00022-5
20. Lai, Y. S. and Wu, C. H. (2002). Meaningful term extraction and discriminative term selection in text categorization via unknown-word methodology. ACM Transactions on Asian Language Information Processing (TALIP), pages 36–64. doi:10.1145/595576.595579
21. Lee, J. H., Yeh, W. C., and Chuang, M. C. (2015). Web page classification based on a simplified swarm optimization. Applied Mathematics and Computation. doi:10.1016/j.amc.2015.07.120
22. Liu, S., Yang, N., Li, M., and Zhou, M. (2014). A recursive recurrent neural network for statistical machine translation. In Proceedings of the Association for Computational Linguistics. doi:10.3115/v1/P14-1140
23. Nicolov, N. and Salvetti, F. (2007). Efficient spam analysis for weblogs through url segmentation. In RANLP, volume 292 of Current Issues in Linguistic Theory (CILT). doi:10.1075/cilt.292.17nic
24. Ozel, S. A. (2011). A web page classification system based on a genetic algorithm using tagged-terms as features. Expert Systems with Applications. doi:10.1016/j.eswa.2010.08.126
25. Pennington, J., Socher, R., and Manning, C. D. (2014). Glove: Global vectors for word representation. Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014). doi:10.3115/v1/D14-1162
26. Selamat, A. and Omatu, S. (2004). Web page feature selection and classification using neural networks. Information Sciences. doi:10.1016/j.ins.2003.03.003
27. Sun, A., Liu, Y., and Lim, E. P. (2011). Web classification of conceptual entities using co-training. Expert Systems with Applications. do:10.1016/j.eswa.2011.03.010
28. Sutskever, I., Vinyals, O., and Le, Q. (2014). Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems (NIPS 2014).
29. Tsukada, M., Washio, T., and Motoda, H. (2001). Automatic web-page classification by using machine learning methods. Web Intelligence: Research and Development, LNAI 2198, pages 303–313. doi:10.1007/3- 540-45490-X_36
30. Vonitsanou, M., Kozanidis, L., and Stamou, S. (2011). Keywords identification within greek urls. Polibits 43, pages 75–80.

Toplam 30 adet kaynakça vardır.

Ayrıntılar

Birincil Dil	İngilizce
Konular	Yapay Zeka
Bölüm	Araştırma Makaleleri
Yazarlar	Mehmet Salih Kurt 0000-0002-0345-4599 Eylem Yücel Demirel 0000-0003-1979-8860
Yayımlanma Tarihi	30 Nisan 2022
Gönderilme Tarihi	31 Mart 2021
Kabul Tarihi	13 Şubat 2022
Yayımlandığı Sayı	Yıl 2022 Cilt: 27 Sayı: 1

Kaynak Göster

APA	Kurt, M. S., & Yücel Demirel, E. (2022). WEB PAGE CLASSIFICATION WITH DEEP LEARNING METHODS. Uludağ Üniversitesi Mühendislik Fakültesi Dergisi, 27(1), 191-204. https://doi.org/10.17482/uumfd.891038
AMA	Kurt MS, Yücel Demirel E. WEB PAGE CLASSIFICATION WITH DEEP LEARNING METHODS. UUJFE. Nisan 2022;27(1):191-204. doi:10.17482/uumfd.891038
Chicago	Kurt, Mehmet Salih, ve Eylem Yücel Demirel. “WEB PAGE CLASSIFICATION WITH DEEP LEARNING METHODS”. Uludağ Üniversitesi Mühendislik Fakültesi Dergisi 27, sy. 1 (Nisan 2022): 191-204. https://doi.org/10.17482/uumfd.891038.
EndNote	Kurt MS, Yücel Demirel E (01 Nisan 2022) WEB PAGE CLASSIFICATION WITH DEEP LEARNING METHODS. Uludağ Üniversitesi Mühendislik Fakültesi Dergisi 27 1 191–204.
IEEE	M. S. Kurt ve E. Yücel Demirel, “WEB PAGE CLASSIFICATION WITH DEEP LEARNING METHODS”, UUJFE, c. 27, sy. 1, ss. 191–204, 2022, doi: 10.17482/uumfd.891038.
ISNAD	Kurt, Mehmet Salih - Yücel Demirel, Eylem. “WEB PAGE CLASSIFICATION WITH DEEP LEARNING METHODS”. Uludağ Üniversitesi Mühendislik Fakültesi Dergisi 27/1 (Nisan2022), 191-204. https://doi.org/10.17482/uumfd.891038.
JAMA	Kurt MS, Yücel Demirel E. WEB PAGE CLASSIFICATION WITH DEEP LEARNING METHODS. UUJFE. 2022;27:191–204.
MLA	Kurt, Mehmet Salih ve Eylem Yücel Demirel. “WEB PAGE CLASSIFICATION WITH DEEP LEARNING METHODS”. Uludağ Üniversitesi Mühendislik Fakültesi Dergisi, c. 27, sy. 1, 2022, ss. 191-04, doi:10.17482/uumfd.891038.
Vancouver	Kurt MS, Yücel Demirel E. WEB PAGE CLASSIFICATION WITH DEEP LEARNING METHODS. UUJFE. 2022;27(1):191-204.

Cited By

Hibrit Derin Öğrenme Modeli ile Web Sitelerinin Görsel ve Metinsel Verilere Dayalı Sınıflandırılması: DeepCLA-Web

ALKÜ Fen Bilimleri Dergisi

https://doi.org/10.46740/alku.1639372

Makale Dosyaları

Tam Metin

DUYURU:

30.03.2021- Nisan 2021 (26/1) sayımızdan itibaren TR-Dizin yeni kuralları gereği, dergimizde basılacak makalelerde, ilk gönderim aşamasında Telif Hakkı Formu yanısıra, Çıkar Çatışması Bildirim Formu ve Yazar Katkısı Bildirim Formu da tüm yazarlarca imzalanarak gönderilmelidir. Yayınlanacak makalelerde de makale metni içinde "Çıkar Çatışması" ve "Yazar Katkısı" bölümleri yer alacaktır. İlk gönderim aşamasında doldurulması gereken yeni formlara "Yazım Kuralları" ve "Makale Gönderim Süreci" sayfalarımızdan ulaşılabilir. (Değerlendirme süreci bu tarihten önce tamamlanıp basımı bekleyen makalelerin yanısıra değerlendirme süreci devam eden makaleler için, yazarlar tarafından ilgili formlar doldurularak sisteme yüklenmelidir). Makale şablonları da, bu değişiklik doğrultusunda güncellenmiştir. Tüm yazarlarımıza önemle duyurulur.

Bursa Uludağ Üniversitesi, Mühendislik Fakültesi Dekanlığı, Görükle Kampüsü, Nilüfer, 16059 Bursa. Tel: (224) 294 1907, Faks: (224) 294 1903, e-posta: mmfd@uludag.edu.tr