Benchmark Effect of Web Search Engines on Text Mining

Ahmet Toprak; Metin Turan

Benchmark Effect of Web Search Engines on Text Mining

Öz

There have been many studies about creating a dictionary and these studies have come from past to present with different methods and different analyzes. Especially with the emergence of the World Wide Web, efforts to create dictionary based on instant data have gained importance. Therefore, the performance of the web search engines directly effects the model which is using web documents for automatic dictionary creation. The web search engines were evaluated in terms of their suggested documents relationality to the query in the research. For this purpose, an automatic dictionary creating model using web documents were developed. First of all, the topic seed words are determined by the documents presented to the system initially. Search is executed by these seed words initially. Then TF-IDF metric was used as meaningful word selection method for returned first document. The top n meaningful words were selected from the highest TF-IDF values. The value of n was determined experimentally. When searching the web with these words added to the dictionary, new documents were suggesting by the web search engine. By repeating the process, experimental dictionaries of a certain size were obtained. By the way, the documents suggested by each web engine are generally different, so that the dictionary similarity produced from the top suggested documents can measure web engines performance of selecting relational documents. Hash similarity was used to evaluate dictionary performance. According to the results, dictionary with the 73.9% highest similarity for Google search engine, dictionary with the 68.7% highest similarity for Bing search engine and dictionary with the 60.5% highest similarity for Yandex search engine were produced.

Anahtar Kelimeler

Kaynakça

B V.Z. Kepuska and P. Rojanasthie, “Speech corpus generation from DVDs of movies and tv series,” Journal of International Technology and Information Management, vol. 20(1), pp. 49-82, 2011.
R. Ellen, “Automatically constructing a dictionary for information extraction tasks,” Proceedings of the Eleventh National Conference on Artificial Intelligence, pp. 811-816, 1993.
S. Koeva, I. Stoyanova, M. Todorova and S. Leseva, “Semi-automatic compilation of the dictionary of Bulgarian multiword expressions,” Proceedings of GLOBALEX 2016, pp. 86-95, 2016. https://doi.org/10.5281/zenodo.1469527
K.E. Silverman, V. Anderson, J.R. Bellegarda, K.A. Lenzo and D. Naik, “Design and collection of corpus of polyphones and prosodic contexts for speech synthesis research and development,” Sixth European Conference on Speech Communication and Technology, PP. 5-9, 1999.
A. Toprak, “Creating English dictionary with natural language processing,” Published Master Thesis, Istanbul Commerce University Institute of Science, Istanbul, 2019.
C. Caldera, R. Berndt, E. Eggeling, M. Schröttner and D.W. Fellner, “PRIMA-towards an automatic review / paper matching score calculation,” The Sixth International Conference on Creative Content Technologies (CONTENT 2014), pp. 70-75, 2014.
A. Mishra, and S. Vishwakarma, “Analysis of TF-IDF model and its variant for document retrieval,” International Conference on Computational Intelligence and Communication Networks (CICN), pp. 772-776, 2015. https://www.doi.org/10.1109/CICN.2015.157
J. Lavid, H.J. Arús, B. Clerck and V. Hoste, “Creation of a high-quality, register-diversified parallel (English-Spanish) corpus for linguistic and computational investigations,” 7th International Conference on Corpus Linguistics (CILC2015), vol. 198, pp. 249-256, 2015. https://doi.org/10.1016/j.sbspro.2015.07.443

S.H. Sarkar and K. Mumit, “Automatic bangla corpus creation,” PAN Localization Working Papers, vol. 3(1), pp. 22-26, 2010.
B. Megyesi, J. Nasman and A. Palmer, “The uppsala corpus of student writings: corpus creation, annotation, and analysis,” Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), pp. 3192-3199, 2016.
F. Famili, W. Shen, R. Weber and E. Simoudis, “Data preprocessing and intelligent data analysis,” Intell. Data Anal, vol. 1(4), pp. 3-23, 1997. https://doi.org/10.1016/S1088-467X(98)00007-9
V. Agarwal, “Research on data preprocessing and categorization technique for smartphone review analysis,” International Journal of Computer Applications, vol. 131(4), pp. 30-36, 2015. https://www.doi.org/10.5120/ijca2015907309
C. Moral, A. Antonio, R. Imbert and J. Ramirez, “A survey of stemming algorithms in information retrieval,” Information Research: An International Electronic Journal, vol. 19(1), pp. 76-80, 2014.
R. Khoury, L. Shi and A. Hamou-Lhadj, “Key elements extraction and traces comprehension using Gestalt Theory and the Helmholtz Principle,” 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME), pp. 478-482, 2016. https://www.doi.org/10.1109/ICSME.2016.24
B. Dadachev, A. Balinsky, H. Balinsky and S. Simske, “On the Helmholtz Principle for data mining,” Third International Conference on Emerging Security Technologies, pp. 99-102, 2012. https://www.doi.org/10.1109/EST.2012.11
S. Jabri, A. Dahbi, T. Gadi and A. Bassir, “Ranking of text documents using TF-IDF weighting and association rules mining,” 2018 4th International Conference on Optimization and Applications (ICOA), pp. 1-6, 2018. https://www.doi.org/10.1109/ICOA.2018.837057
A.G. Jivani, “A comparative study of stemming algorithms,” Int. J. Comp. Tech. Appl, vol. 2(6), pp. 1930-1938, 2011.
M.S Charikar, “Similarity estimation techniques from rounding algorithms,” In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing,pp.380-388,2002. https://www.doi.org/10.1145/509907.509965
Y. Li, F. Liu, Z. Du and D. Zhang, “A simhash-based integrative features extraction algorithm for malware detection,” Algorithms-Open Access Journal, vol. 11(8), pp. 1-13, 2018. https://doi.org/10.3390/a11080124
Y. Zhang, Z. Jin, W. Mu and W. Wang, “Research of distinct algorithm of short text based on simhash,” DEStech Transactions on Engineering and Technology Research, pp. 120-126, 2017. https://www.doi.org/10.12783/dtetr/oect2017/16127
Q. Jiang and M. Sun, “Semi-supervised simhash for efficient document similarity search,” The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference, vol. 1, pp. 93-101, 2011.
B. Pi, S. Fu, W. Wang and S. Han, “SimHash-based effective and efficient detecting of near duplicate short messages,” Proceedings of the Second Symposium International Computer Science and Computational Technology (ISCSCT ’09), pp. 20-25, 2009.
M. Turan and S. Ogtelik, “İngilizce dokümanlarda tema ve alt kavramlar tespit modeli,” Düzce Üniversitesi Bilim ve Teknoloji Dergisi, vol. 6(4), pp. 754-764, 2018. https://doi.org/10.29130/dubited.420104

Ayrıntılar

Birincil Dil

İngilizce

Konular

Mühendislik

Bölüm

Araştırma Makalesi

Yazarlar

Ahmet Toprak ^*
Türkiye

Metin Turan
Türkiye

Yayımlanma Tarihi

15 Ocak 2021

Gönderilme Tarihi

21 Temmuz 2020

Kabul Tarihi

15 Kasım 2020

Yayımlandığı Sayı

Yıl 2021 Cilt: 4 Sayı: 1

IZ

https://izlik.org/JA23GJ35ZR

Kaynak Göster

RIS / Bibtex

APA

Toprak, A., & Turan, M. (2021). Benchmark Effect of Web Search Engines on Text Mining. Veri Bilimi, 4(1), 84-92. https://izlik.org/JA23GJ35ZR

AMA

1.Toprak A, Turan M. Benchmark Effect of Web Search Engines on Text Mining. Veri Bilim Derg. 2021;4(1):84-92. https://izlik.org/JA23GJ35ZR

Chicago

Toprak, Ahmet, ve Metin Turan. 2021. “Benchmark Effect of Web Search Engines on Text Mining”. Veri Bilimi 4 (1): 84-92. https://izlik.org/JA23GJ35ZR.

EndNote

Toprak A, Turan M (01 Ocak 2021) Benchmark Effect of Web Search Engines on Text Mining. Veri Bilimi 4 1 84–92.

IEEE

[1]A. Toprak ve M. Turan, “Benchmark Effect of Web Search Engines on Text Mining”, Veri Bilim Derg, c. 4, sy 1, ss. 84–92, Oca. 2021, [çevrimiçi]. Erişim adresi: https://izlik.org/JA23GJ35ZR

ISNAD

Toprak, Ahmet - Turan, Metin. “Benchmark Effect of Web Search Engines on Text Mining”. Veri Bilimi 4/1 (01 Ocak 2021): 84-92. https://izlik.org/JA23GJ35ZR.

JAMA

1.Toprak A, Turan M. Benchmark Effect of Web Search Engines on Text Mining. Veri Bilim Derg. 2021;4:84–92.

MLA

Toprak, Ahmet, ve Metin Turan. “Benchmark Effect of Web Search Engines on Text Mining”. Veri Bilimi, c. 4, sy 1, Ocak 2021, ss. 84-92, https://izlik.org/JA23GJ35ZR.

Vancouver

1.Ahmet Toprak, Metin Turan. Benchmark Effect of Web Search Engines on Text Mining. Veri Bilim Derg [Internet]. 01 Ocak 2021;4(1):84-92. Erişim adresi: https://izlik.org/JA23GJ35ZR