Performance of Using Tag-based Feature Sets in Web Page Classification

Selma Ayşe Özel; Havva Esin Ünal; İlker Ünal

Performance of Using Tag-based Feature Sets in Web Page Classification

Year 2018, Volume: 22 Issue: 2, 583 - 594, 15.08.2018

Selma Ayşe Özel Havva Esin Ünal İlker Ünal

Abstract

As the Web is a large collection of data growing daily, an automatic Web page classification mechanism is needed to effectively reach to useful information. Majority of the Web pages are in the form of HTML documents, therefore the aim of this study is to explore the effect of HTML tags on classification process, and try to determine the most valuable HTML tags for feature extraction of the classification task. To achieve this goal, we employ 13 different datasets, and use 5 popular classifiers that are SVM, naïve bayes (NB), kNN, C4.5, and OneR. The statistical analysis shows that, the features extracted by using solely the anchor, <p> or <title> tags can be used as an alternative to the features extracted from the whole Web page. SVM is the best among the classifiers used in this study. Using the HTML tags for feature extraction improves classification accuracy.

Keywords

Web mining, Classification; HTML tags; Feature extraction

References

[1] Shaker, M., Ibrahim, H., Mustapha, A. and Abdullah, L. N. 2009. Information Extraction From Hypertext Mark-up Language Web Pages. Journal of Computer Science, 5(8), 596-607.
[2] Soonthomphisaj, N., Chartbanchachai, P., Pratheeptham, T. and Kijsirikul, B. 2002. Web Page Categorization Using Hierarchical Headings Structure. Proceedings of the 24th International Conference on Information Technology Interfaces in Cavtat, Croatia, IEEE, 37-42.
[3] Xue, W., Bao, H., Huang, W. and Lu, Y. 2006. Web Page Classification Based on SVM. Proceedings of the 6th World Congress on Intelligent Control and Automation in Dalian, China, IEEE, 6111-6114.
[4] Werner, L., Böttcher, S. and Beckmann, R. 2005. Enhanced Information Retrieval by Using HTML Tags. Proceedings of the 2005 International Conference on Data Mining in Las Vegas, Nevada, USA, CSREA Press, 24-29.
[5] Kim, S. and Zhang, B.-T. 2003. Genetic Mining of HTML Structures for Effective Web-document Retrieval. Applied Intelligence, 18(3), 243–256.
[6] Özel, S. A. 2011. A Web Page Classification System Based on a Genetic Algorithm Using Tagged-terms as Features. Expert Systems with Applications, 38(4), 3407-3415.
[7] Golub, K. and Ardo, A. 2005. Importance of HTML structural elements and metadata in automated subject classification. Proceedings of the 9th European Conference on Research and Advanced Technology for Digital Libraries in Vienna, Austria, Springer-Verlag, 368–378.
[8] Yang, Y., Slattery, S. and Ghani, R. 2002. A Study of Approaches to Hypertext Categorization. Journal of Intelligent Information Systems, 18(2-3), 219–241.
[9] Fresno, V., Martinez, R., Montalvo, S. and Casillas, A. 2006. Naive Bayes Web Page Classication with HTML Mark-up Enrichment. Proceedings of the International Multi-Conference on Computing in the Global Information Technology in Bucharest, Romania, IEEE, 48-53.
[10] Belmouhcine, A., Idrissi, A. and Benkhalifa, M. 2013. Web Classification Approach Using Reduced Vector Representation Model Based on HTML Tags. Journal of Theoratical and Applied Information Technology, Vol.55 No.1, 137-148.
[11] Saraç, E. and Özel, S. A. 2013. Web Page Classification Using Firefly Optimization. 2013 IEEE International Symposium on INnovations in Intelligent SysTems and Applications in Albena, Bulgaria, IEEE, 1-5.
[12] Saraç, E. and Özel, S. A. 2014. An Ant Colony Optimization Based Feature Selection for Web Page Classification. The Scientific World Journal, Vol. 2014, Article ID 649260 (2014), 16 pages.
[13] Meshkizadeh, S. and Rahmani, A. M. 2010. Webpage Classification Based on Compound of Using HTML Features & URL Features and Features of Sibling Pages. International Journal of Advencements in Computing Technology, 2(4), 36-46.
[14] Jeong, O., Oh, J., Kim, D., Lyu, H. and Kim, W. 2014. Determining the Titles of Web Pages Using Anchor Text and Link Analysis. Expert Systems with Applications, Vol. 41 No. 9 (2014), 4322-4329.
[15] Ünal, H. E., Özel, S. A. and Ünal, İ. 2013. Effect of Tagged-Terms on Web Page Classification Accuracy. Global Journal on Technology, Vol. 3 (2013), 244-250.
[16] Bhalla, V.K. and Kumar, N. 2016. An Efficient Scheme for Automatic Web Pages Categorizaiton Using the Support Vector Machine. New Review of Hypermedia and Multimedia, Vol. 22 No:3 (2016), 223-242.
[17] Ester, M., Kriegel, H.-P. and Schubert, M. 2002. Web Site Mining: A New Way to Spot Competitors, Customers and Suppliers in the World Wide Web. Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining in Edmonton, CA, USA, ACM Press, 249-258.
[18] Qi, D. and Sun, B. 2004. A Genetic k-means Approaches for Automated Web Page Classification. Proceedings of the 2004 IEEE International Conference on Information Reuse and Integration in Las Vegas, Nevada, USA, IEEE, 241–246.
[19] Bie, R., Fu, Z., Sun, Q. and Chen, C. 2010. A Comparison Study of Bayesian Classifiers on Web pages classification. New Generation Computing, 28(2), 161-168.
[20] Davison, B. D. 2000. Topical Locality in the Web, Proceedings of the 23rd Annual International Conference on Research and Development in Information Retrieval in Athens, Greece, ACM Press, 272-279.
[21] Pierre, J. M. 2001. On the Automated Classification of Web Sites. Linköping Electronic Articles in Computer and Information Science, Vol. 6 (2001), arXiv preprint cs/0102002.
[22] Qi, X. and Davison, B. D. 2009. Web Page Classification: Features and Algorithms. ACM Computing Surveys, 41(2), Article 12.
[23] Ru, Y. and Horowitz, E. 2007. Automated Classification of HTML Forms on E-commerce Web Sites. Online Information Review, Vol. 31 No. 4 (2007), 451 - 466.
[24] Sun, A., Lim, E.-P. and Ng, W.-K. 2002. Web Classification Using Support Vector Machine. Proceedings of the 4th International Workshop on Web Information and Data Management in New York, USA, ACM Press, 96–99.
[25] Navadiay, D., Parikh, M. and Patel, R. 2013. Constructure Based Web Page Classification. International Journal of Computer Science and Management Research, 2(6), 2742-2746.
[26] A. M. Sarhan, G. M. Hamissa and H. E. Elbehiry, 2015. Feature Selection Algorithms Based on HTML Tags Importance. 2015 Tenth International Conference on Computer Engineering & Systems (ICCES), Cairo, pp. 185-190.
[27] B. Thanasopon, N. Sumret, J. Buranapanitkij and P. Netisopakul. 2017. Extraction and evaluation of popular online trends: A case of Pantip.com. 9th International Conference on Information Technology and Electrical Engineering (ICITEE), Phuket, pp. 1-5.
[28] Özel, S. A. 2011. A Genetic Algorithm Based Optimal Feature Selection for Web Page Classification. Proceedings of the 2011 International Symposium on Innovations in Intelligent Systems and Applications in Istanbul, Turkey, IEEE, 282-286.
[29] Craven, M., DiPasquo, D., Freitag, D., McCallum, A., Mitchell, T., Nigam, K. and Slattery, S. 1998. Learning to Extract Symbolic knowledge From the World Wide Web. Proceedings of the 15th National Conference on Artificial Intelligence in Madison, Wisconsin, USA, American Association for Artificial Intelligence, 509–516.
[30] Ghani, R. 2001. CMU World Wide Knowledge Base (Web->KB) Project. http://www.cs.cmu.edu/~webkb/ (Access Date: 12 February 2016).
[31] Sinka, M. and Corne, D. (2002), “A large benchmark dataset for Web document clustering”, Soft Computing Systems: Design, Management and Applications, Vol. 87, 881-890.
[32] Pazzani, M. 1998. Syskill and Webert Web Page Ratings. http://kdd.ics.uci.edu/databases/ SyskillWebert/SyskillWebert.data.html (Access Date: 12 February 2016).
[33] Porter, M. F. 1980. An Algorithm for Suffix Stripping. Program, 14(3), 130–137.
[34] Salton, G., Wong, A. and Yang, C. S. 1975. A Vector Space Model for Automatic Indexing. Communications of the ACM, 18(11), 613-620.
[35] Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I. H. 2009. The WEKA Data Mining Software: An Update. ACM Special Interest Group on Knowledge Discovery in Data Explorations Newsletter, 11(1),10-18.
[36] Witten, I. H., Frank, E. and Hall, M. A. 2011. Data mining: practical machine learning tools and techniques with Java implementations, Morgan Kaufmann Publishers, San Francisco, CA.
[37] Joachims, T. 1998. Text Categorization with Support Vector Machines: Learning with Many Relevant Features. Proceedings of the 10th European Conference on Machine Learning in Chemnitz, Germany, Springer-Verlag, 137-142.
[38] Baykan, E., Henzinger, M., Marian, L. and Weber, I. 2011. A Comprehensive Study of Features and Algorithms for URL-based Topic Classification. ACM Transactions on the Web, 5(3), Article 15.
[39] Han, J., Kamber, M. and Pei, J. 2011. Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, San Francisco, CA.
[40] Rennie, J.D.M., Shih, L., Teevan, J. and Karger, D.R. 2003. Tackling the Poor Assumptions of Naive Bayes Text Classiers. Proceedings of the Twentieth International Conference on Machine Learning in Washington DC, USA, AAAI Press, 616-623.

Year 2018, Volume: 22 Issue: 2, 583 - 594, 15.08.2018

Selma Ayşe Özel Havva Esin Ünal İlker Ünal

Abstract

References

[1] Shaker, M., Ibrahim, H., Mustapha, A. and Abdullah, L. N. 2009. Information Extraction From Hypertext Mark-up Language Web Pages. Journal of Computer Science, 5(8), 596-607.
[2] Soonthomphisaj, N., Chartbanchachai, P., Pratheeptham, T. and Kijsirikul, B. 2002. Web Page Categorization Using Hierarchical Headings Structure. Proceedings of the 24th International Conference on Information Technology Interfaces in Cavtat, Croatia, IEEE, 37-42.
[3] Xue, W., Bao, H., Huang, W. and Lu, Y. 2006. Web Page Classification Based on SVM. Proceedings of the 6th World Congress on Intelligent Control and Automation in Dalian, China, IEEE, 6111-6114.
[4] Werner, L., Böttcher, S. and Beckmann, R. 2005. Enhanced Information Retrieval by Using HTML Tags. Proceedings of the 2005 International Conference on Data Mining in Las Vegas, Nevada, USA, CSREA Press, 24-29.
[5] Kim, S. and Zhang, B.-T. 2003. Genetic Mining of HTML Structures for Effective Web-document Retrieval. Applied Intelligence, 18(3), 243–256.
[6] Özel, S. A. 2011. A Web Page Classification System Based on a Genetic Algorithm Using Tagged-terms as Features. Expert Systems with Applications, 38(4), 3407-3415.
[7] Golub, K. and Ardo, A. 2005. Importance of HTML structural elements and metadata in automated subject classification. Proceedings of the 9th European Conference on Research and Advanced Technology for Digital Libraries in Vienna, Austria, Springer-Verlag, 368–378.
[8] Yang, Y., Slattery, S. and Ghani, R. 2002. A Study of Approaches to Hypertext Categorization. Journal of Intelligent Information Systems, 18(2-3), 219–241.
[9] Fresno, V., Martinez, R., Montalvo, S. and Casillas, A. 2006. Naive Bayes Web Page Classication with HTML Mark-up Enrichment. Proceedings of the International Multi-Conference on Computing in the Global Information Technology in Bucharest, Romania, IEEE, 48-53.
[10] Belmouhcine, A., Idrissi, A. and Benkhalifa, M. 2013. Web Classification Approach Using Reduced Vector Representation Model Based on HTML Tags. Journal of Theoratical and Applied Information Technology, Vol.55 No.1, 137-148.
[11] Saraç, E. and Özel, S. A. 2013. Web Page Classification Using Firefly Optimization. 2013 IEEE International Symposium on INnovations in Intelligent SysTems and Applications in Albena, Bulgaria, IEEE, 1-5.
[12] Saraç, E. and Özel, S. A. 2014. An Ant Colony Optimization Based Feature Selection for Web Page Classification. The Scientific World Journal, Vol. 2014, Article ID 649260 (2014), 16 pages.
[13] Meshkizadeh, S. and Rahmani, A. M. 2010. Webpage Classification Based on Compound of Using HTML Features & URL Features and Features of Sibling Pages. International Journal of Advencements in Computing Technology, 2(4), 36-46.
[14] Jeong, O., Oh, J., Kim, D., Lyu, H. and Kim, W. 2014. Determining the Titles of Web Pages Using Anchor Text and Link Analysis. Expert Systems with Applications, Vol. 41 No. 9 (2014), 4322-4329.
[15] Ünal, H. E., Özel, S. A. and Ünal, İ. 2013. Effect of Tagged-Terms on Web Page Classification Accuracy. Global Journal on Technology, Vol. 3 (2013), 244-250.
[16] Bhalla, V.K. and Kumar, N. 2016. An Efficient Scheme for Automatic Web Pages Categorizaiton Using the Support Vector Machine. New Review of Hypermedia and Multimedia, Vol. 22 No:3 (2016), 223-242.
[17] Ester, M., Kriegel, H.-P. and Schubert, M. 2002. Web Site Mining: A New Way to Spot Competitors, Customers and Suppliers in the World Wide Web. Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining in Edmonton, CA, USA, ACM Press, 249-258.
[18] Qi, D. and Sun, B. 2004. A Genetic k-means Approaches for Automated Web Page Classification. Proceedings of the 2004 IEEE International Conference on Information Reuse and Integration in Las Vegas, Nevada, USA, IEEE, 241–246.
[19] Bie, R., Fu, Z., Sun, Q. and Chen, C. 2010. A Comparison Study of Bayesian Classifiers on Web pages classification. New Generation Computing, 28(2), 161-168.
[20] Davison, B. D. 2000. Topical Locality in the Web, Proceedings of the 23rd Annual International Conference on Research and Development in Information Retrieval in Athens, Greece, ACM Press, 272-279.
[21] Pierre, J. M. 2001. On the Automated Classification of Web Sites. Linköping Electronic Articles in Computer and Information Science, Vol. 6 (2001), arXiv preprint cs/0102002.
[22] Qi, X. and Davison, B. D. 2009. Web Page Classification: Features and Algorithms. ACM Computing Surveys, 41(2), Article 12.
[23] Ru, Y. and Horowitz, E. 2007. Automated Classification of HTML Forms on E-commerce Web Sites. Online Information Review, Vol. 31 No. 4 (2007), 451 - 466.
[24] Sun, A., Lim, E.-P. and Ng, W.-K. 2002. Web Classification Using Support Vector Machine. Proceedings of the 4th International Workshop on Web Information and Data Management in New York, USA, ACM Press, 96–99.
[25] Navadiay, D., Parikh, M. and Patel, R. 2013. Constructure Based Web Page Classification. International Journal of Computer Science and Management Research, 2(6), 2742-2746.
[26] A. M. Sarhan, G. M. Hamissa and H. E. Elbehiry, 2015. Feature Selection Algorithms Based on HTML Tags Importance. 2015 Tenth International Conference on Computer Engineering & Systems (ICCES), Cairo, pp. 185-190.
[27] B. Thanasopon, N. Sumret, J. Buranapanitkij and P. Netisopakul. 2017. Extraction and evaluation of popular online trends: A case of Pantip.com. 9th International Conference on Information Technology and Electrical Engineering (ICITEE), Phuket, pp. 1-5.
[28] Özel, S. A. 2011. A Genetic Algorithm Based Optimal Feature Selection for Web Page Classification. Proceedings of the 2011 International Symposium on Innovations in Intelligent Systems and Applications in Istanbul, Turkey, IEEE, 282-286.
[29] Craven, M., DiPasquo, D., Freitag, D., McCallum, A., Mitchell, T., Nigam, K. and Slattery, S. 1998. Learning to Extract Symbolic knowledge From the World Wide Web. Proceedings of the 15th National Conference on Artificial Intelligence in Madison, Wisconsin, USA, American Association for Artificial Intelligence, 509–516.
[30] Ghani, R. 2001. CMU World Wide Knowledge Base (Web->KB) Project. http://www.cs.cmu.edu/~webkb/ (Access Date: 12 February 2016).
[31] Sinka, M. and Corne, D. (2002), “A large benchmark dataset for Web document clustering”, Soft Computing Systems: Design, Management and Applications, Vol. 87, 881-890.
[32] Pazzani, M. 1998. Syskill and Webert Web Page Ratings. http://kdd.ics.uci.edu/databases/ SyskillWebert/SyskillWebert.data.html (Access Date: 12 February 2016).
[33] Porter, M. F. 1980. An Algorithm for Suffix Stripping. Program, 14(3), 130–137.
[34] Salton, G., Wong, A. and Yang, C. S. 1975. A Vector Space Model for Automatic Indexing. Communications of the ACM, 18(11), 613-620.
[35] Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I. H. 2009. The WEKA Data Mining Software: An Update. ACM Special Interest Group on Knowledge Discovery in Data Explorations Newsletter, 11(1),10-18.
[36] Witten, I. H., Frank, E. and Hall, M. A. 2011. Data mining: practical machine learning tools and techniques with Java implementations, Morgan Kaufmann Publishers, San Francisco, CA.
[37] Joachims, T. 1998. Text Categorization with Support Vector Machines: Learning with Many Relevant Features. Proceedings of the 10th European Conference on Machine Learning in Chemnitz, Germany, Springer-Verlag, 137-142.
[38] Baykan, E., Henzinger, M., Marian, L. and Weber, I. 2011. A Comprehensive Study of Features and Algorithms for URL-based Topic Classification. ACM Transactions on the Web, 5(3), Article 15.
[39] Han, J., Kamber, M. and Pei, J. 2011. Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, San Francisco, CA.
[40] Rennie, J.D.M., Shih, L., Teevan, J. and Karger, D.R. 2003. Tackling the Poor Assumptions of Naive Bayes Text Classiers. Proceedings of the Twentieth International Conference on Machine Learning in Washington DC, USA, AAAI Press, 616-623.

There are 40 citations in total.

Details

Journal Section	Articles
Authors	Selma Ayşe Özel This is me Havva Esin Ünal This is me İlker Ünal
Publication Date	August 15, 2018
Published in Issue	Year 2018 Volume: 22 Issue: 2

Cite

APA	Özel, S. A., Ünal, H. E., & Ünal, İ. (2018). Performance of Using Tag-based Feature Sets in Web Page Classification. Süleyman Demirel Üniversitesi Fen Bilimleri Enstitüsü Dergisi, 22(2), 583-594.
AMA	Özel SA, Ünal HE, Ünal İ. Performance of Using Tag-based Feature Sets in Web Page Classification. J. Nat. Appl. Sci. August 2018;22(2):583-594.
Chicago	Özel, Selma Ayşe, Havva Esin Ünal, and İlker Ünal. “Performance of Using Tag-Based Feature Sets in Web Page Classification”. Süleyman Demirel Üniversitesi Fen Bilimleri Enstitüsü Dergisi 22, no. 2 (August 2018): 583-94.
EndNote	Özel SA, Ünal HE, Ünal İ (August 1, 2018) Performance of Using Tag-based Feature Sets in Web Page Classification. Süleyman Demirel Üniversitesi Fen Bilimleri Enstitüsü Dergisi 22 2 583–594.
IEEE	S. A. Özel, H. E. Ünal, and İ. Ünal, “Performance of Using Tag-based Feature Sets in Web Page Classification”, J. Nat. Appl. Sci., vol. 22, no. 2, pp. 583–594, 2018.
ISNAD	Özel, Selma Ayşe et al. “Performance of Using Tag-Based Feature Sets in Web Page Classification”. Süleyman Demirel Üniversitesi Fen Bilimleri Enstitüsü Dergisi 22/2 (August 2018), 583-594.
JAMA	Özel SA, Ünal HE, Ünal İ. Performance of Using Tag-based Feature Sets in Web Page Classification. J. Nat. Appl. Sci. 2018;22:583–594.
MLA	Özel, Selma Ayşe et al. “Performance of Using Tag-Based Feature Sets in Web Page Classification”. Süleyman Demirel Üniversitesi Fen Bilimleri Enstitüsü Dergisi, vol. 22, no. 2, 2018, pp. 583-94.
Vancouver	Özel SA, Ünal HE, Ünal İ. Performance of Using Tag-based Feature Sets in Web Page Classification. J. Nat. Appl. Sci. 2018;22(2):583-94.

Download Cover Image

Article Files

Full Text

e-ISSN :1308-6529
Linking ISSN (ISSN-L): 1300-7688

All published articles in the journal can be accessed free of charge and are open access under the Creative Commons CC BY-NC (Attribution-NonCommercial) license. All authors and other journal users are deemed to have accepted this situation. Click here to access detailed information about the CC BY-NC license.