Düzeltme
BibTex RIS Kaynak Göster

Düzeltme: Effective Seed URL Selection and Scope Extension Algorithm for Web Crawler

Yıl 2023, Cilt: 35 Sayı: 3, 406 - 417, 30.09.2023
Bu makalenin ilk hali 30 Mart 2023 tarihinde yayımlandı. https://dergipark.org.tr/tr/pub/jeps/issue/76433/1174193

Düzeltme Notu

Öz

The web is a huge data source which is rapidly growing and which keeps all kinds of data. Users use search engines to get the data they want from this data source. Search engines obtain these data through web crawlers. Web crawlers retrieve, parse, and index data on all pages they reach by tracking uniform resource locators (URL) on web pages. The most important issues in the web crawling process are which URLs to start from, and the scope of the crawl. In this study, seed URL selection and scope expansion methods of a general web crawler were presented. In the selection of seed URLs, three different seed URL sets were created based on the daily hours spent by the visitors in 102 different countries, the number of daily page views per visitor, the percentage of traffic from the search, and the total number of affiliate sites, and their performance was analyzed thoroughly. Furthermore, a new search algorithm based on link score was proposed to expand the scope quickly, searches were made, compared, and detailed analyzes were performed using seed URL sets.

Proje Numarası

118C127

Kaynakça

  • [1] "Internet Users Distribution in the World." https://www.internetworldstats.com/stats.htm (accessed 30/03/2022, 2022).
  • [2] M. Abu Kausar, V. Dhaka, and S. Singh, "Web Crawler: A Review," International Journal of Computer Applications, vol. 63, pp. 31-36, 02/01 2013, doi: 10.5120/10440-5125.
  • [3] S. M. Pavalam, S. V. K. Raja, F. K. Akorli, and M. Jawahar, "A survey of web crawler algorithms," International Journal of Computer Science Issues (IJCSI), vol. 8, no. 6, p. 309, 2011.
  • [4] F. Menczer, G. Pant, P. Srinivasan, and M. E. Ruiz, "Evaluating topic-driven Web crawlers," in Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, 2001, pp. 241-249.
  • [5] A. Arasu, J. Cho, H. Garcia-Molina, A. Paepcke, and S. Raghavan, "Searching the web," ACM Transactions on Internet Technology (TOIT), vol. 1, no. 1, pp. 2-43, 2001.
  • [6] C. Castillo, "Effective web crawling," SIGIR Forum, vol. 39, no. 1, pp. 55–56, 2005, doi: 10.1145/1067268.1067287.
  • [7] X. Zhang and K. P. Chow, "A Framework for Dark Web Threat Intelligence Analysis," International Journal of Digital Crime and Forensics (IJDCF), vol. 10, no. 4, pp. 108-117, 2018, doi: 10.4018/IJDCF.2018100108.
  • [8] M. R. Henzinger, "Algorithmic challenges in web search engines," Internet Mathematics, vol. 1, no. 1, pp. 115-123, 2004.
  • [9] S. Daneshpajouh, M. M. Nasiri, and M. Ghodsi, "A Fast Community Based Algorithm for Generating Web Crawler Seeds Set," in WEBIST (2), 2008, pp. 98-105.
  • [10] J. M. Kleinberg, "Authoritative sources in a hyperlinked environment," 1998, vol. 98: Citeseer, pp. 668-677.
  • [11] S. Zheng, P. Dmitriev, and C. L. Giles, "Graph-based seed selection for web-scale crawlers," presented at the Proceedings of the 18th ACM conference on Information and knowledge management, Hong Kong, China, 2009. [Online]. Available: https://doi.org/10.1145/1645953.1646277.
  • [12] B. Ganguly and R. Sheikh, "A review of focused web crawling strategies," International Journal of Advanced Computer Research, vol. 2, no. 4, p. 261, 2012.
  • [13] F. J. M. Shamrat, Z. Tasnim, A. S. Rahman, N. I. Nobel, and S. A. Hossain, "An effective implementation of web crawling technology to retrieve data from the world wide web (www)," International Journal of Scientific & Technology Research, vol. 9, no. 01, pp. 1252-1256, 2020.
  • [14] L. Jiang and H. Zhang, "Multi-agent based individual web spider system," in 2010 World Automation Congress, 2010: IEEE, pp. 177-181.
  • [15] S.-B. Chan and H. Yamana, "The method of improving the specific language focused crawler," in CIPS-SIGHAN Joint Conference on Chinese Language Processing, 2010.
  • [16] F. Menczer and A. E. Monge, "Scalable web search by adaptive online agents: An infospiders case study," in Intelligent Information Agents: Springer, 1999, pp. 323-347.
  • [17] P. N. Priyatam, A. Dubey, K. Perumal, S. Praneeth, D. Kakadia, and V. Varma, "Seed selection for domain-specific search," presented at the Proceedings of the 23rd International Conference on World Wide Web, Seoul, Korea, 2014. [Online]. Available: https://doi.org/10.1145/2567948.2579216.
  • [18] L. Sanagavarapu, S. Sarangi, R. Reddy, and V. Varma, Fine Grained Approach for Domain Specific Seed URL Extraction. 2018.
  • [19] L. M. Sanagavarapu, S. Sarangi, and Y. R. Reddy, "ABC Algorithm for URL Extraction," in ICWE Workshops, 2017.
  • [20] S. Pavalam, S. K. Raja, F. K. Akorli, and M. Jawahar, "A survey of web crawler algorithms," International Journal of Computer Science Issues (IJCSI), vol. 8, no. 6, p. 309, 2011.
  • [21] N. Alderratia and M. Elsheh, "Using Web Pages Dynamicity to Prioritise Web Crawling," in Proceedings of the 2019 2nd International Conference on Machine Learning and Machine Intelligence, 2019, pp. 40-44.
  • [22] J. Cho, H. Garcia-Molina, and L. Page, "Efficient crawling through URL ordering," Computer networks and ISDN systems, vol. 30, no. 1-7, pp. 161-172, 1998.
  • [23] J. Prakash and R. Kumar, "Web Crawling through Shark-Search using PageRank," Procedia Computer Science, vol. 48, pp. 210-216, 2015/01/01/ 2015, doi: https://doi.org/10.1016/j.procs.2015.04.172.
  • [24] L. Cao et al., "Rankcompete: Simultaneous ranking and clustering of information networks," Neurocomputing, vol. 95, pp. 98-104, 2012.
  • [25] M. Najork and J. L. Wiener, "Breadth-first crawling yields high-quality pages," presented at the Proceedings of the 10th international conference on World Wide Web, Hong Kong, Hong Kong, 2001. [Online]. Available: https://doi.org/10.1145/371920.371965.
  • [26] D. Gupta and D. Singh, "User preference based page ranking algorithm," in 2016 International Conference on Computing, Communication and Automation (ICCCA), 29-30 April 2016 2016, pp. 166-171, doi: 10.1109/CCAA.2016.7813711.
  • [27] F. Alhaidari, S. Alwarthan, and A. Alamoudi, "User Preference Based Weighted Page Ranking Algorithm," in 2020 3rd International Conference on Computer Applications & Information Security (ICCAIS), 19-21 March 2020 2020, pp. 1-6, doi: 10.1109/ICCAIS48893.2020.9096823.
  • [28] M. Baker and M. Akcayol, "Priority Queue Based Estimation of Importance of Web Pages for Web Crawlers," International Journal of Computer Electrical Engineering, vol. 9, pp. 330-342, 07/27 2017, doi: 10.17706/ijcee.2017.9.1.330-342.
  • [29] Alexa. "The top 500 sites on the web." Amazın. https://www.alexa.com/topsites/countries (accessed 9:12:2021, 2021).

Düzeltme: Web Tarayıcıları için Etkili Tohum URL Seçimi ve Kapsam Genişletme Algoritması

Yıl 2023, Cilt: 35 Sayı: 3, 406 - 417, 30.09.2023
Bu makalenin ilk hali 30 Mart 2023 tarihinde yayımlandı. https://dergipark.org.tr/tr/pub/jeps/issue/76433/1174193

Düzeltme Notu

Makaleye Tesekkür eklenmesi unutulduğu için TEŞEKKÜR Bu çalışma, TÜBİTAK tarafından BİDEB-2244 Sanayi Doktora Programı kapsamında 118C127 numara ile desteklenen "İnternette Heterojen Veri Kaynaklarından Veri Toplanması, Doğrulanması ve Sorgulanması" başlıklı projenin bir parçasıdır. Sağladığı destek için TÜBİTAK’a teşekkür ederiz.

Öz

Web, hızla büyüyen ve her türden verilerin bulunduğu devasa bir veri kaynağıdır. Kullanıcılar bu veri kaynağından istedikleri verileri almak için arama motorlarını kullanırlar. Arama motorları bu verileri web tarayıcıları ile elde ederler. Web tarayıcıları web sayfalarındaki tek düzen kaynak bulucuları (URL-Uniform Resource Locator) izleyerek ulaştıkları tüm sayfalardaki verileri alır, ayrıştırır ve indekslerler. Web tarama sürecindeki en önemli konular hangi URL’lerden başlanacağı ve taramanın kapsamıdır. Bu yazıda kapsamı tüm web olan genel bir tarayıcının tohum URL seçim ve kapsam genişletme yöntemleri sunulmuştur. Tohum URL seçiminde 102 farklı ülkede ziyaretçinin günlük harcadığı saat, ziyaretçi başına günlük sayfa görüntüleme sayısı, aramadan gelen trafiğin yüzdesi ve toplam bağlı site sayısı temel alınarak oluşturulmuş üç farklı tohum URL seti oluşturulup detaylı bir şekilde performansları analiz edilmiştir. Ayrıca kapsamı hızlı bir şekilde genişletmek için link skoruna dayalı yeni bir tarama algoritması önerilmiş, tohum URL setleri kullanılarak taramalar yapılmış, karşılaştırılmış ve detaylı analizleri yapılmıştır.

Destekleyen Kurum

TÜBİTAK

Proje Numarası

118C127

Teşekkür

Bu çalışma, TÜBİTAK tarafından BİDEB-2244 Sanayi Doktora Programı kapsamında 118C127 numara ile desteklenen "İnternette Heterojen Veri Kaynaklarından Veri Toplanması, Doğrulanması ve Sorgulanması" başlıklı projenin bir parçasıdır. Sağladığı destek için TÜBİTAK’a teşekkür ederiz.

Kaynakça

  • [1] "Internet Users Distribution in the World." https://www.internetworldstats.com/stats.htm (accessed 30/03/2022, 2022).
  • [2] M. Abu Kausar, V. Dhaka, and S. Singh, "Web Crawler: A Review," International Journal of Computer Applications, vol. 63, pp. 31-36, 02/01 2013, doi: 10.5120/10440-5125.
  • [3] S. M. Pavalam, S. V. K. Raja, F. K. Akorli, and M. Jawahar, "A survey of web crawler algorithms," International Journal of Computer Science Issues (IJCSI), vol. 8, no. 6, p. 309, 2011.
  • [4] F. Menczer, G. Pant, P. Srinivasan, and M. E. Ruiz, "Evaluating topic-driven Web crawlers," in Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, 2001, pp. 241-249.
  • [5] A. Arasu, J. Cho, H. Garcia-Molina, A. Paepcke, and S. Raghavan, "Searching the web," ACM Transactions on Internet Technology (TOIT), vol. 1, no. 1, pp. 2-43, 2001.
  • [6] C. Castillo, "Effective web crawling," SIGIR Forum, vol. 39, no. 1, pp. 55–56, 2005, doi: 10.1145/1067268.1067287.
  • [7] X. Zhang and K. P. Chow, "A Framework for Dark Web Threat Intelligence Analysis," International Journal of Digital Crime and Forensics (IJDCF), vol. 10, no. 4, pp. 108-117, 2018, doi: 10.4018/IJDCF.2018100108.
  • [8] M. R. Henzinger, "Algorithmic challenges in web search engines," Internet Mathematics, vol. 1, no. 1, pp. 115-123, 2004.
  • [9] S. Daneshpajouh, M. M. Nasiri, and M. Ghodsi, "A Fast Community Based Algorithm for Generating Web Crawler Seeds Set," in WEBIST (2), 2008, pp. 98-105.
  • [10] J. M. Kleinberg, "Authoritative sources in a hyperlinked environment," 1998, vol. 98: Citeseer, pp. 668-677.
  • [11] S. Zheng, P. Dmitriev, and C. L. Giles, "Graph-based seed selection for web-scale crawlers," presented at the Proceedings of the 18th ACM conference on Information and knowledge management, Hong Kong, China, 2009. [Online]. Available: https://doi.org/10.1145/1645953.1646277.
  • [12] B. Ganguly and R. Sheikh, "A review of focused web crawling strategies," International Journal of Advanced Computer Research, vol. 2, no. 4, p. 261, 2012.
  • [13] F. J. M. Shamrat, Z. Tasnim, A. S. Rahman, N. I. Nobel, and S. A. Hossain, "An effective implementation of web crawling technology to retrieve data from the world wide web (www)," International Journal of Scientific & Technology Research, vol. 9, no. 01, pp. 1252-1256, 2020.
  • [14] L. Jiang and H. Zhang, "Multi-agent based individual web spider system," in 2010 World Automation Congress, 2010: IEEE, pp. 177-181.
  • [15] S.-B. Chan and H. Yamana, "The method of improving the specific language focused crawler," in CIPS-SIGHAN Joint Conference on Chinese Language Processing, 2010.
  • [16] F. Menczer and A. E. Monge, "Scalable web search by adaptive online agents: An infospiders case study," in Intelligent Information Agents: Springer, 1999, pp. 323-347.
  • [17] P. N. Priyatam, A. Dubey, K. Perumal, S. Praneeth, D. Kakadia, and V. Varma, "Seed selection for domain-specific search," presented at the Proceedings of the 23rd International Conference on World Wide Web, Seoul, Korea, 2014. [Online]. Available: https://doi.org/10.1145/2567948.2579216.
  • [18] L. Sanagavarapu, S. Sarangi, R. Reddy, and V. Varma, Fine Grained Approach for Domain Specific Seed URL Extraction. 2018.
  • [19] L. M. Sanagavarapu, S. Sarangi, and Y. R. Reddy, "ABC Algorithm for URL Extraction," in ICWE Workshops, 2017.
  • [20] S. Pavalam, S. K. Raja, F. K. Akorli, and M. Jawahar, "A survey of web crawler algorithms," International Journal of Computer Science Issues (IJCSI), vol. 8, no. 6, p. 309, 2011.
  • [21] N. Alderratia and M. Elsheh, "Using Web Pages Dynamicity to Prioritise Web Crawling," in Proceedings of the 2019 2nd International Conference on Machine Learning and Machine Intelligence, 2019, pp. 40-44.
  • [22] J. Cho, H. Garcia-Molina, and L. Page, "Efficient crawling through URL ordering," Computer networks and ISDN systems, vol. 30, no. 1-7, pp. 161-172, 1998.
  • [23] J. Prakash and R. Kumar, "Web Crawling through Shark-Search using PageRank," Procedia Computer Science, vol. 48, pp. 210-216, 2015/01/01/ 2015, doi: https://doi.org/10.1016/j.procs.2015.04.172.
  • [24] L. Cao et al., "Rankcompete: Simultaneous ranking and clustering of information networks," Neurocomputing, vol. 95, pp. 98-104, 2012.
  • [25] M. Najork and J. L. Wiener, "Breadth-first crawling yields high-quality pages," presented at the Proceedings of the 10th international conference on World Wide Web, Hong Kong, Hong Kong, 2001. [Online]. Available: https://doi.org/10.1145/371920.371965.
  • [26] D. Gupta and D. Singh, "User preference based page ranking algorithm," in 2016 International Conference on Computing, Communication and Automation (ICCCA), 29-30 April 2016 2016, pp. 166-171, doi: 10.1109/CCAA.2016.7813711.
  • [27] F. Alhaidari, S. Alwarthan, and A. Alamoudi, "User Preference Based Weighted Page Ranking Algorithm," in 2020 3rd International Conference on Computer Applications & Information Security (ICCAIS), 19-21 March 2020 2020, pp. 1-6, doi: 10.1109/ICCAIS48893.2020.9096823.
  • [28] M. Baker and M. Akcayol, "Priority Queue Based Estimation of Importance of Web Pages for Web Crawlers," International Journal of Computer Electrical Engineering, vol. 9, pp. 330-342, 07/27 2017, doi: 10.17706/ijcee.2017.9.1.330-342.
  • [29] Alexa. "The top 500 sites on the web." Amazın. https://www.alexa.com/topsites/countries (accessed 9:12:2021, 2021).
Toplam 29 adet kaynakça vardır.

Ayrıntılar

Birincil Dil Türkçe
Konular Mühendislik
Bölüm Düzeltme Makalesi
Yazarlar

Zülfü Alanoğlu 0000-0001-9710-5658

Mehmet Akçayol 0000-0002-6615-1237

Proje Numarası 118C127
Yayımlanma Tarihi 30 Eylül 2023
Yayımlandığı Sayı Yıl 2023 Cilt: 35 Sayı: 3

Kaynak Göster

APA Alanoğlu, Z., & Akçayol, M. (2023). Web Tarayıcıları için Etkili Tohum URL Seçimi ve Kapsam Genişletme Algoritması. International Journal of Advances in Engineering and Pure Sciences, 35(3), 406-417.
AMA Alanoğlu Z, Akçayol M. Web Tarayıcıları için Etkili Tohum URL Seçimi ve Kapsam Genişletme Algoritması. JEPS. Eylül 2023;35(3):406-417.
Chicago Alanoğlu, Zülfü, ve Mehmet Akçayol. “Web Tarayıcıları için Etkili Tohum URL Seçimi Ve Kapsam Genişletme Algoritması”. International Journal of Advances in Engineering and Pure Sciences 35, sy. 3 (Eylül 2023): 406-17.
EndNote Alanoğlu Z, Akçayol M (01 Eylül 2023) Web Tarayıcıları için Etkili Tohum URL Seçimi ve Kapsam Genişletme Algoritması. International Journal of Advances in Engineering and Pure Sciences 35 3 406–417.
IEEE Z. Alanoğlu ve M. Akçayol, “Web Tarayıcıları için Etkili Tohum URL Seçimi ve Kapsam Genişletme Algoritması”, JEPS, c. 35, sy. 3, ss. 406–417, 2023.
ISNAD Alanoğlu, Zülfü - Akçayol, Mehmet. “Web Tarayıcıları için Etkili Tohum URL Seçimi Ve Kapsam Genişletme Algoritması”. International Journal of Advances in Engineering and Pure Sciences 35/3 (Eylül 2023), 406-417.
JAMA Alanoğlu Z, Akçayol M. Web Tarayıcıları için Etkili Tohum URL Seçimi ve Kapsam Genişletme Algoritması. JEPS. 2023;35:406–417.
MLA Alanoğlu, Zülfü ve Mehmet Akçayol. “Web Tarayıcıları için Etkili Tohum URL Seçimi Ve Kapsam Genişletme Algoritması”. International Journal of Advances in Engineering and Pure Sciences, c. 35, sy. 3, 2023, ss. 406-17.
Vancouver Alanoğlu Z, Akçayol M. Web Tarayıcıları için Etkili Tohum URL Seçimi ve Kapsam Genişletme Algoritması. JEPS. 2023;35(3):406-17.