Research Article
BibTex RIS Cite

Metin Madenciliği, Makine ve Derin Öğrenme Algoritmaları ile Web Sayfalarının Sınıflandırılması

Year 2019, Volume: 5 Issue: 2, 29 - 43, 31.12.2019

Abstract

Web sitelerin sayısı hızlı bir şekilde artmakta
ve bu sitelerde bulunabilecek zararlı içeriği engellemek ya da yararlı
bilgilere daha kolay ulaşmak için, Web sayfalarını içerikleri doğrultusunda
sınıflandırmak bir çözüm olarak ortaya çıkmaktadır. Sınıflandırma sayesinde,
belirli sitelerin erişimine izin verilebilir veya bunları engellemek için Web
siteleri filtrelenebilir. Bu çalışmada, farklı makine öğrenmesi yöntemleri ve
yapay sinir ağları kullanılarak Web sitesi sınıflandırma problemi
incelenmiştir. Bu sınıflandırma probleminin çözümü için, İkili Sınıflandırma ve
Çoklu Sınıflandırma olarak iki farklı yaklaşım uygulanmış, her iki yaklaşım da
çalışma kapsamında toplanan Web siteleri üzerinde test edilip, başarımları
karşılaştırılmıştır. Tüm deneysel sonuçlar göz önüne alındığında İkili
Sınıflandırma yaklaşımının, sadece istenilen bir Web site sınıfının
filtrelenmesi görevini yerine getirmek için kullanıldığında daha etkili olacağı
tespit edilmiştir. Başarıma bakıldığında ikili sınıflandırıcılar için en iyi
performans gösteren algoritma Lojistik Regresyondur. Çoklu Sınıflandırma
yaklaşımında uygulanan algoritmaları arasından ise en yüksek başarıma sahip
yöntem Destek Vektör Makineleri (SVM) olmuştur. Ayrıca, Çoklu Sınıflandırma
problemi için farklı kelime vektörleştirme yöntemleri denenmiş ve
performansları karşılaştırılmıştır. İkili ve Çoklu sınıflandırma
yaklaşımlarında kullanılan algoritmalarının ayrı ayrı ve farklı vektörleştirme
yöntemleri ile denenmesi, Web sayfalarının sınıflandırılması ve içerik filtrelenmesi
problemlerini birlikte ele alınmasını sağlamış olup, alandaki benzer
çalışmalardan farkı ortaya konmuştur.

References

  • Chen, Y., Cheng, B. ve Cheng, X. (2016). Food safety document classification using LSTM-based ensemble learning. Revista Técnica de la Facultad de Ingeniería Universidad del Zulia, 39(10), 172-178.
  • Chen, R. C. ve Hsieh, C. H. (2006). Web page classification based on a support vector machine using a weighted vote schema. Expert Systems with Applications, 31(2), 427-435.
  • Gali, N., Mariescu-Istodor, R. ve Fränti, P. (2017). Using linguistic features to automatically extract web page title. Expert Systems with Applications, 79, 296-312.
  • Hartmann, J., Huppertz, J., Schamp, C. ve Heitmann, M. (2019). Comparing automated text classification methods. International Journal of Research in Marketing, 36(1), 20-38.
  • Hilbe, J. M. (2011). Logistic regression. International encyclopedia of statistical science, 755-758.
  • Internet Live Stats (2019). “Total Number of Websites”, https://www.internetlivestats.com/total-number-of-websites/ (erişim tarihi: 16.05.2019)
  • Kowsari, K., Jafari Meimandi, K., Heidarysafa, M., Mendu, S., Barnes, L. ve Brown, D. (2019). Text classification algorithms: A survey. Information, 10(4).
  • Li, Y. H. ve Jain, A. K. (1998). Classification of text documents. The Computer Journal, 41(8), 537-546.
  • Loper, E. ve Bird, S. (2002). NLTK: the natural language toolkit. arXiv preprint cs/0205028.
  • Manning, C., Raghavan, P. ve Schütze, H. (2010). Introduction to information retrieval. Natural Language Engineering, 16(1), 100-103.
  • Netcraft (2019). “July 2019 Web Server Survey”, https://news.netcraft.com/archives/category/web-server-survey/ (erişim tarihi: 16.05.2019)
  • Onan, A., Korukoğlu, S. ve Bulut, H. (2016). Ensemble of keyword extraction methods and classifiers in text classification. Expert Systems with Applications, 57, 232-247.
  • Panigrahi, R. ve Borah, S. (2019). Classification and Analysis of Facebook Metrics Dataset Using Supervised Classifiers. In S. Borah, N. Dey, R. Babo & A. S. Ashour (Eds.), Social Network Analytics, Elsevier.
  • Rekik, R., Kallel, I., Casillas, J. ve Alimi, A. M. (2018). Assessing web sites quality: A systematic literature review by text and association rules mining. International Journal of Information Management, 38(1), 201-216.
  • Ren, X. Y., Shi, C., Zhang, D. ve Wang, W. S. (2019). An improved SVM web page classification algorithm. In Journal of Physics: Conference Series (Vol. 1187, No. 4, p. 042063). IOP Publishing.
  • Shen, D., Yang, Q., & Chen, Z. (2007). Noise reduction through summarization for Web-page classification. Information Processing & Management, 43(6), 1735-1747.
  • Sinoara, R. A., Camacho-Collados, J., Rossi, R. G., Navigli, R., & Rezende, S. O. (2019). Knowledge-enhanced document embeddings for text classification. Knowledge-Based Systems, 163, 955-971.
  • Stein, R. A., Jaques, P. A., & Valiati, J. F. (2019). An analysis of hierarchical text classification using word embeddings. Information Sciences, 471, 216-232.
  • Takenouchi, T., & Ishii, S. (2018). Binary classifiers ensemble based on Bregman divergence for multi-class classification. Neurocomputing, 273, 424-434.
  • Xu, S., Li, Y., & Wang, Z. (2017). Bayesian multinomial Naïve Bayes classifier to text classification. In Advanced multimedia and ubiquitous engineering (pp. 347-352). Springer, Singapore.

Web Page Categorization with Text Mining, Machine and Deep Learning Algorithms

Year 2019, Volume: 5 Issue: 2, 29 - 43, 31.12.2019

Abstract

As
the number of Web sites is growing rapidly, classifying Web pages with respect
to their contents proposes itself as a possible solution to prevent accessing
malicious content that may be found on these sites or to access useful
information in an easier way. With such a classification, access to specific
sites may be allowed or these sites may be filtered and thus access to them may
be prevented. In this study, the Web site classification problem is examined by
using different machine learning methods and artificial neural networks. In
order to solve this classification problem, two different approaches are
proposed, namely Binary Classification and Multiple Classification. Both
approaches are tested and their performances are compared by using a number of
Web sites collected for this study. Considering all experimental results, it
has been found that the Binary Classification approach is more effective only
when it is used to perform the task of filtering a desired Web site class. In
terms of performance, Logistic Regression is the best performing algorithm for
binary classifiers. Among the algorithms applied in the Multiple Classification
approach, Support Vector Machines (SVM) is found as the most successful method.
Furthermore, different word vectorization methods have been employed and their
performances have been compared within the Multiple Classification problem. Algorithms
used in Binary and Multi-class Classification approaches have been separately
tested by using different vectorization methods. By this way the classification
and content filtering problems on Web pages have been approached together, thus
differentiating this study from similar researches in the domain.

References

  • Chen, Y., Cheng, B. ve Cheng, X. (2016). Food safety document classification using LSTM-based ensemble learning. Revista Técnica de la Facultad de Ingeniería Universidad del Zulia, 39(10), 172-178.
  • Chen, R. C. ve Hsieh, C. H. (2006). Web page classification based on a support vector machine using a weighted vote schema. Expert Systems with Applications, 31(2), 427-435.
  • Gali, N., Mariescu-Istodor, R. ve Fränti, P. (2017). Using linguistic features to automatically extract web page title. Expert Systems with Applications, 79, 296-312.
  • Hartmann, J., Huppertz, J., Schamp, C. ve Heitmann, M. (2019). Comparing automated text classification methods. International Journal of Research in Marketing, 36(1), 20-38.
  • Hilbe, J. M. (2011). Logistic regression. International encyclopedia of statistical science, 755-758.
  • Internet Live Stats (2019). “Total Number of Websites”, https://www.internetlivestats.com/total-number-of-websites/ (erişim tarihi: 16.05.2019)
  • Kowsari, K., Jafari Meimandi, K., Heidarysafa, M., Mendu, S., Barnes, L. ve Brown, D. (2019). Text classification algorithms: A survey. Information, 10(4).
  • Li, Y. H. ve Jain, A. K. (1998). Classification of text documents. The Computer Journal, 41(8), 537-546.
  • Loper, E. ve Bird, S. (2002). NLTK: the natural language toolkit. arXiv preprint cs/0205028.
  • Manning, C., Raghavan, P. ve Schütze, H. (2010). Introduction to information retrieval. Natural Language Engineering, 16(1), 100-103.
  • Netcraft (2019). “July 2019 Web Server Survey”, https://news.netcraft.com/archives/category/web-server-survey/ (erişim tarihi: 16.05.2019)
  • Onan, A., Korukoğlu, S. ve Bulut, H. (2016). Ensemble of keyword extraction methods and classifiers in text classification. Expert Systems with Applications, 57, 232-247.
  • Panigrahi, R. ve Borah, S. (2019). Classification and Analysis of Facebook Metrics Dataset Using Supervised Classifiers. In S. Borah, N. Dey, R. Babo & A. S. Ashour (Eds.), Social Network Analytics, Elsevier.
  • Rekik, R., Kallel, I., Casillas, J. ve Alimi, A. M. (2018). Assessing web sites quality: A systematic literature review by text and association rules mining. International Journal of Information Management, 38(1), 201-216.
  • Ren, X. Y., Shi, C., Zhang, D. ve Wang, W. S. (2019). An improved SVM web page classification algorithm. In Journal of Physics: Conference Series (Vol. 1187, No. 4, p. 042063). IOP Publishing.
  • Shen, D., Yang, Q., & Chen, Z. (2007). Noise reduction through summarization for Web-page classification. Information Processing & Management, 43(6), 1735-1747.
  • Sinoara, R. A., Camacho-Collados, J., Rossi, R. G., Navigli, R., & Rezende, S. O. (2019). Knowledge-enhanced document embeddings for text classification. Knowledge-Based Systems, 163, 955-971.
  • Stein, R. A., Jaques, P. A., & Valiati, J. F. (2019). An analysis of hierarchical text classification using word embeddings. Information Sciences, 471, 216-232.
  • Takenouchi, T., & Ishii, S. (2018). Binary classifiers ensemble based on Bregman divergence for multi-class classification. Neurocomputing, 273, 424-434.
  • Xu, S., Li, Y., & Wang, Z. (2017). Bayesian multinomial Naïve Bayes classifier to text classification. In Advanced multimedia and ubiquitous engineering (pp. 347-352). Springer, Singapore.
There are 20 citations in total.

Details

Primary Language Turkish
Journal Section Articles
Authors

Oumout Chouseinoglou 0000-0002-8513-351X

İlker Şahin This is me 0000-0002-7416-1013

Publication Date December 31, 2019
Published in Issue Year 2019 Volume: 5 Issue: 2

Cite

APA Chouseinoglou, O., & Şahin, İ. (2019). Metin Madenciliği, Makine ve Derin Öğrenme Algoritmaları ile Web Sayfalarının Sınıflandırılması. Yönetim Bilişim Sistemleri Dergisi, 5(2), 29-43.