Performance of Using Tag-based Feature Sets in Web Page Classification

Volume: 22 Number: 2 August 15, 2018
TR

Performance of Using Tag-based Feature Sets in Web Page Classification

Abstract

As the Web is a large collection of data growing daily, an automatic Web page classification mechanism is needed to effectively reach to useful information. Majority of the Web pages are in the form of HTML documents, therefore the aim of this study is to explore the effect of HTML tags on classification process, and try to determine the most valuable HTML tags for feature extraction of the classification task. To achieve this goal, we employ 13 different datasets, and use 5 popular classifiers that are SVM, naïve bayes (NB), kNN, C4.5, and OneR. The statistical analysis shows that, the features extracted by using solely the anchor, <p> or <title> tags can be used as an alternative to the features extracted from the whole Web page. SVM is the best among the classifiers used in this study. Using the HTML tags for feature extraction improves classification accuracy.

Keywords

References

  1. [1] Shaker, M., Ibrahim, H., Mustapha, A. and Abdullah, L. N. 2009. Information Extraction From Hypertext Mark-up Language Web Pages. Journal of Computer Science, 5(8), 596-607.
  2. [2] Soonthomphisaj, N., Chartbanchachai, P., Pratheeptham, T. and Kijsirikul, B. 2002. Web Page Categorization Using Hierarchical Headings Structure. Proceedings of the 24th International Conference on Information Technology Interfaces in Cavtat, Croatia, IEEE, 37-42.
  3. [3] Xue, W., Bao, H., Huang, W. and Lu, Y. 2006. Web Page Classification Based on SVM. Proceedings of the 6th World Congress on Intelligent Control and Automation in Dalian, China, IEEE, 6111-6114.
  4. [4] Werner, L., Böttcher, S. and Beckmann, R. 2005. Enhanced Information Retrieval by Using HTML Tags. Proceedings of the 2005 International Conference on Data Mining in Las Vegas, Nevada, USA, CSREA Press, 24-29.
  5. [5] Kim, S. and Zhang, B.-T. 2003. Genetic Mining of HTML Structures for Effective Web-document Retrieval. Applied Intelligence, 18(3), 243–256.
  6. [6] Özel, S. A. 2011. A Web Page Classification System Based on a Genetic Algorithm Using Tagged-terms as Features. Expert Systems with Applications, 38(4), 3407-3415.
  7. [7] Golub, K. and Ardo, A. 2005. Importance of HTML structural elements and metadata in automated subject classification. Proceedings of the 9th European Conference on Research and Advanced Technology for Digital Libraries in Vienna, Austria, Springer-Verlag, 368–378.
  8. [8] Yang, Y., Slattery, S. and Ghani, R. 2002. A Study of Approaches to Hypertext Categorization. Journal of Intelligent Information Systems, 18(2-3), 219–241.

Details

Primary Language

Turkish

Subjects

-

Journal Section

-

Publication Date

August 15, 2018

Submission Date

November 6, 2017

Acceptance Date

-

Published in Issue

Year 2018 Volume: 22 Number: 2

APA
Özel, S. A., Ünal, H. E., & Ünal, İ. (2018). Performance of Using Tag-based Feature Sets in Web Page Classification. Süleyman Demirel Üniversitesi Fen Bilimleri Enstitüsü Dergisi, 22(2), 583-594. https://izlik.org/JA94JE67GM
AMA
1.Özel SA, Ünal HE, Ünal İ. Performance of Using Tag-based Feature Sets in Web Page Classification. J. Nat. Appl. Sci. 2018;22(2):583-594. https://izlik.org/JA94JE67GM
Chicago
Özel, Selma Ayşe, Havva Esin Ünal, and İlker Ünal. 2018. “Performance of Using Tag-Based Feature Sets in Web Page Classification”. Süleyman Demirel Üniversitesi Fen Bilimleri Enstitüsü Dergisi 22 (2): 583-94. https://izlik.org/JA94JE67GM.
EndNote
Özel SA, Ünal HE, Ünal İ (August 1, 2018) Performance of Using Tag-based Feature Sets in Web Page Classification. Süleyman Demirel Üniversitesi Fen Bilimleri Enstitüsü Dergisi 22 2 583–594.
IEEE
[1]S. A. Özel, H. E. Ünal, and İ. Ünal, “Performance of Using Tag-based Feature Sets in Web Page Classification”, J. Nat. Appl. Sci., vol. 22, no. 2, pp. 583–594, Aug. 2018, [Online]. Available: https://izlik.org/JA94JE67GM
ISNAD
Özel, Selma Ayşe - Ünal, Havva Esin - Ünal, İlker. “Performance of Using Tag-Based Feature Sets in Web Page Classification”. Süleyman Demirel Üniversitesi Fen Bilimleri Enstitüsü Dergisi 22/2 (August 1, 2018): 583-594. https://izlik.org/JA94JE67GM.
JAMA
1.Özel SA, Ünal HE, Ünal İ. Performance of Using Tag-based Feature Sets in Web Page Classification. J. Nat. Appl. Sci. 2018;22:583–594.
MLA
Özel, Selma Ayşe, et al. “Performance of Using Tag-Based Feature Sets in Web Page Classification”. Süleyman Demirel Üniversitesi Fen Bilimleri Enstitüsü Dergisi, vol. 22, no. 2, Aug. 2018, pp. 583-94, https://izlik.org/JA94JE67GM.
Vancouver
1.Selma Ayşe Özel, Havva Esin Ünal, İlker Ünal. Performance of Using Tag-based Feature Sets in Web Page Classification. J. Nat. Appl. Sci. [Internet]. 2018 Aug. 1;22(2):583-94. Available from: https://izlik.org/JA94JE67GM

e-ISSN :1308-6529
Linking ISSN (ISSN-L): 1300-7688

All published articles in the journal can be accessed free of charge and are open access under the Creative Commons CC BY-NC (Attribution-NonCommercial) license. All authors and other journal users are deemed to have accepted this situation. Click here to access detailed information about the CC BY-NC license.