Yıl 2019, Cilt 7 , Sayı 3, Sayfalar 608 - 618 2019-09-15

BÜYÜK VERİDE METİN BENZERLİK ALGORİTMALARININ VERİ EŞLEME PERFORMANSLARININ KARŞILAŞTIRILMASI
COMPARISON OF THE DATA MATCHING PERFORMANCES OF STRING SIMILARITY ALGORITHMS IN BIG DATA

Bekir AKSOY [1] , Sinan UĞUZ [2] , Okan ORAL [3]


Son yıllarda dünya turizmindeki büyük hareketlilik, bu sektörün büyük verinin çalışma alanları arasına girmesini sağlamıştır. Bu çalışmada farklı sağlayıcılardan gelen otel bilgilerinin, veritabanlarına farklı isim ve adreslerle girilmesi sonucu oluşan problemler için, büyük veri ve string similarity algoritmaları (SSA) kullanarak bir çözüm önerisi ortaya konulmuştur. Bunun için geniş bir otel ağına sahip bir turizm acentasının Londra’da bulunan 2599 oteli örneklem olarak seçilmiş ve bu oteller ile yetmiş farklı sağlayıcıdan gelen yaklaşık üç milyon otel bilgisinin eşleştirilmesi için, soundex algoritmasından faydalanılarak Map-Reduce işlemi gerçekleştirilmiştir. Map-Reduce ile eşleme işlem sayısı ve işlem süresinde önemli ölçüde azalma sağlanmıştır. Çalışmanın diğer aşamasında ise Dice coefficient, Levenshtein ve Longest common subsequence (LCS) algoritmaları, doğru eşleyebildikleri veri ve işlem süresi açısından kıyaslanmıştır. Bu aşamada algoritmalar uygulanmadan önce veri tabanında algoritmaların skorunu düşüren kelimeler tespit edilerek çıkartılmıştır. Doğru eşleme bakımından Dice coefficient algoritması, işlem süresi açısından ise Levenshtein algoritması daha iyi sonuçlar üretmiştir.

The great mobility in the world tourism in recent years has also enabled this sector to be included among the study areas of big data. In this study, a solution proposal was put forward by using the big data and string similarity algorithms (SSA) for the problems arising from the entry of the hotel data coming from different providers into databases with different names and addresses. Therefore, 2599 hotels of a tourism agency with a wide hotel network located in London were selected as the sample, and the Map-Reduce process was performed by using the Soundex algorithm to match these hotels with approximately three million hotel data coming from seventy different providers. Matching with Map-Reduce ensured a significant reduction in process count and process time. Furthermore, the Dice coefficient, Levenshtein and Longest common subsequence (LCS) algorithms were compared in terms of the data that they correctly matched, and process time. In this stage, the words decreasing the score of the algorithms in the database were detected and removed before the algorithms were implemented. The Dice coefficient algorithm yielded better results in terms of correct matching, and the Levenshtein algorithm yielded better results in terms of process time.

  • Bakar, Z. A., Sembok, T. M. T., and Yusoff, M., 2000. An evaluation of retrieval effectiveness using spelling-correction and string-similarity matching methods on Malay texts, Journal of the Association for Information Science and Technology, vol. 51, no. 8, pp. 691-706, doi: 10.1002/(SICI)1097-4571(2000)51:8<691: :AID-ASI20>3.0.CO;2-U
  • Baruah, D., and Mahanta, A. K., 2013. A new similarity measure with length factor for plagiarism detection, International Journal of Computer Applications, vol. 72, no. 14, pp. 14-17.
  • Baruah, D., and Mahanta, A. K., 2015. Design and development of soundex for assamese language, International Journal of Computer Applications, vol. 117, no. 9, pp. 9-12, doi: 10.5120/20581-3000
  • Bhatti, Z., Waqas, A., Ismaili, I. A., Hakro, D. N., and Soomro, W. J., 2014. Phonetic based soundex and shapeex algorithm for Sindhi spell checker system, Advances in Environmental Biology, vol. 8, no. 4, pp. 1147-1155.
  • Bird, S., Klein, E., and Loper, E., 2009. Natural Language Processing with Python. O’Reilly Press, pp. 463.
  • Cavoukian, A., and Jonas, J., 2012. Privacy by design in the age of big data. Information and Privacy Commissioner of Ontario, Canada, pp. 3.
  • Chaudhary, A., Wakchoure, N., Gotarne, N., Nath, P., and B., Dhakulkar, 2016. A comparative study on name matching algorithms, International Journal of Research in Advent Technology, vol. 4, no. 5, pp. 127-129.
  • Chen, X., and Zhou, L., 2015. Design and implementation of an intelligent system for tourist routes recommendation based on Hadoop, 6th IEEE International Conference on Software Engineering and Service Science (ICSESS), Beijing, pp. 774–778. doi: 10.1109/ICSESS.2015.7339171
  • Chowdhury, S. R., Hasan, M. M., Iqbal, S., and Rahman, M. S., 2014. Computing a longest common palindromic subsequence, Fundamenta Informaticae, vol. 129, no. 4, pp. 329-340, doi: 10.3233/FI-2014-974
  • Dice, L. R., 1945. Measures of the amount of ecologic association between species, Ecology, vol. 26, no. 3, pp. 297-302.
  • Dursun, B., and Sonmez, A. C., 2008. A new method for computing the similarity of Turkish texts, IEEE 16th Signal Processing, Communication and Applications Conference, Aydın, pp. 76. doi: 10.1109/SIU.2008.4632581
  • Freeman, A. T., Condon, S. L., and Ackerman, C. M., 2006. Cross linguistic name matching in English and Arabic: a one to many mapping extension of the Levenshtein edit distance algorithm, in proc. Main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics. Association for Computational Linguistics, pp. 471-478, doi:10.3115/1220835.1220895
  • Fuentes, A. A. G., Parra, I. P., Quevedo-Torrero, J. U., and Perez, R. D., 2016. Comparative analysis of phonetic algorithms applied to Spanish,” International Conference on Computational Science and Computational Intelligence (CSCI), Las Vegas, pp. 1180-1185, doi: 10.1109/CSCI.2016.0223
  • Gupta P., and Upadhyay, A., 2015. Sentiment and predictive analysis of big data for hotel reviews, International Journal of Software & Hardware Research in Engineering, vol. 3, no. 5, pp. 78–86.
  • Heeringa, W. J. 2004. Measuring dialect pronunciation differences using Levenshtein distance, Groningen: s.n, pp.323.
  • Ilhan, S., Duru, N., Karagoz, S., and Sagir, M., 2008. Metin madenciligi ile soru cevaplama sistemi, Electrical – Electronics - Computer Engineering Symposium, Bursa, pp. 356-359.
  • Jaisunder, G. C, Ahmed, I., and Mishra, R. K., 2017. Need for customized soundex based algorithm on indian names for phonetic matching, Global Journal of Enterprise Information System, vol. 8, no. 2, pp. 30-35, doi: 10.18311/gjeis/2016/7658
  • Jiang, Y., Deng, D., Wang, J., and Li, G., 2013. Efficient parallel partition based algorithms for similarity search and join with edit distance constraints, in Proc. Joint EDBT/ICDT 2013 Workshops, Genoa. doi: 10.1145/2457317.2457382
  • Kisla, T., Karaoglan, B., and Metin, S. K., 2015. Extracting the Features of Similarity in Short Texts. IEEE 23th Signal Processing And Communications Applications Conference, Malatya, pp. 180-183, doi: 10.1109/SIU.2015.7130443
  • Kruskal, J. B., and Sankoff, D., 1999. Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Stanford, CA: CSLI Publications.
  • Kurdziel, L. B. F., and Spencer, R. M. C., 2016. Consolidation of novel word learning in native English-speaking adults, Memory, vol. 24, no. 4, pp. 471-481, doi: 10.1080/09658211.2015.1019889
  • Levenshtein, V. I., 1966. Binary codes capable of correcting deletions, insertions, and reversals, Soviet Physics Doklady, vol. 10, no. 8. pp. 707-710.
  • Li, G., Deng, D., and Feng, J., 2013. A partition-based method for string similarity joins with edit-distance constraints, ACM Transactions on Database Systems (TODS), vol. 38, no. 2, pp. 1–33, doi: 10.1145/2487259.2487261
  • Li, X., Pan, B., Law, R., and Huang, X., 2017. Forecasting tourism demand with composite search index, Tourism Management, vol. 59, pp. 57-66, 2017. doi: 10.1016/j.tourman.2016.07.005
  • Liu, Y., Teichert, T., Rossi, M., Li, H., and Hu, F., 2017. Big data for big insights: Investigating language-specific drivers of hotel satisfaction with 412,784 user-generated reviews, Tourism Management, vol. 59, pp. 554–563.
  • Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., and Watkins, C., 2002. Text classification using string kernels, Journal of Machine Learning Research, vol. 2, pp. 419-444.
  • Miah, S. J., Vu, H. Q., Gammack, J.,. and McGrath, M., 2017. A big data analytics method for tourist behaviour analysis, Information & Management, vol. 54, no. 6, pp. 771-785, doi: 10.1016/j.im.2016.11.011
  • Mutalib N. S. A., and Noah, S. A., 2011. Phonetic coding methods for Malay names retrieval,” International Conference on Semantic Technology and Information Retrieval, Putrajaya, pp. 125-129. doi: 10.1109/STAIR.2011.5995776
  • Naumann, F., and Herschel, M., 2010. An introduction to duplicate detection,” Synthesis Lectures on Data Management, vol. 2, no.1, pp. 1-87, doi: 10.2200/ S00262ED1V01Y201003DTM003
  • Nyirarugira, C., and Kim, T., 2015. Stratified gesture recognition using the normalized longest common subsequence with rough sets, Signal Processing: Image Communication, vol. 30, pp. 178-189, doi: 10.1016/j.image.2014.10.00844.
  • Odell, M., and Russell, R., 1918. The soundex coding system, US Patents 1261167.
  • Onder, I., 2017. Classifying multi-destination trips in Austria with big data, Tourism Management Perspectives, vol. 21, pp. 54-58, doi: 10.1016/j.tmp.2016.11.002
  • Parmar, V. P., and Kumbharana, C. K., 2014. Study existing various phonetic algorithms and designing and development of a working model for the new developed algorithm and comparison by implementing it with existing algorithm (s), International Journal of Computer Applications, vol. 98, no. 19, pp. 45-49.
  • Peng, X., and Huang, Z., 2012. Enabling semantic queries against the spatial database, Advances in Electrical and Computer Engineering, vol. 12, no.1, pp. 45-50, doi: 10.4316/AECE.2012.01008
  • Sagiroglu, S., and Sinanc, D., 2013. Big data: A review, International Conference on Collaboration Technologies and Systems (CTS), San Diego, pp 42-47. doi: 10.1109/CTS.2013.6567202
  • Shedeed, H. A., and Abdel, H., 2011. A new intelligent methodology for computer based assessment of short answer question based on a new enhanced soundex phonetic algorithm for Arabic language, International Journal of Computer Applications, vol. 34, no. 10, pp. 40-47.
  • Shrote, K. R., and Deorankar, A. V., 2016 Hotel recommendation system using hadoop and mapreduce for big data, International Journal of Computer Science, Information Technology, and Security, vol. 6, no. 2, pp. 137–141.
  • Stein-Smith, K., 2016. The US Foreign Language Deficit: Strategies for Maintaining a Competitive Edge in a Globalized World. Palgrave Macmillan, pp. 21, doi: 10.1007/978-3-319-34159-0
  • Su, Z., Ahn, B. R., Eom, K. Y., Kang, M. K., Kim, J. P., and Kim, M. K., 2008. Plagiarism detection using the Levenshtein distance and Smith-Waterman algorithm, 3rd International Conference on Innovative Computing Information and Control, Dalian, Liaoning, pp. 0-3. doi: 10.1109/ICICIC.2008.422
  • Tabataba F. S., and Mousavi, S. R., 2012. A hyper-heuristic for the longest common subsequence problem, Computational Biology and Chemistry, vol. 36, pp. 42–54, doi: 10.1016/j.compbiolchem.2011.12.004
  • Toole, J. L., Colak, S., Sturt, B., Alexander, L. P., Evsukoff, A., and González, M. C., The path most traveled: Travel demand estimation using big data resources, Transportation Research Part C: Emerging Technologies, vol. 58, pp. 162-177, 2015. doi: 10.1016/j.trc.2015.04.022
  • Ugon, A., T. 2015. Nicolas, M. Richard, P. Guerin, P. Chansard, C. Demoor, and L. Toubiana, “A new approach for cleansing geographical dataset using Levenshtein distance, prior knowledge and contextual information, Medical Informatics Europe, Madrid, pp. 227-229. doi: 10.3233/978-1-61499-512-8-227
  • Xiang, L. , Jiang, N., Ya-ting, Y., Xi, Z., and Cheng-gang, M., 2014. Application of generalization language model in Chinese-Uyghur machine translation, Application Research of Computers, vol. 31, no. 10, pp. 2994-2997, doi: 10.3969/j.issn.1001-3695.2014.10.026.
  • Xiang, Z., Schwartz, Z., Gerdes, J. H., and Uysal, M., 2015. What can big data and text analytics tell us about hotel guest experience and satisfaction? International Journal of Hospitality Management, vol. 44, pp. 120-130, doi: 10.1016/j.ijhm.2014.10.013
  • Yahia, M. E., Saeed, M. E., and Salih, A. M., 2006. An intelligent algorithm for Arabic soundex function using intuitionistic fuzzy logic, 3rd International IEEE Conference Intelligent Systems, London, pp. 711-715. doi: 10.1109/IS.2006.348506
  • Zikopoulos, P., and Eaton, C., 2011. Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data. Mcgraw-Hill Osborne Media Press, pp. 176.
Birincil Dil en
Konular Bilgisayar Bilimleri, Bilgi Sistemleri
Bölüm Araştırma Makalesi \ Research Makaleler
Yazarlar

Orcid: 0000-0001-8052-9411
Yazar: Bekir AKSOY
Kurum: ısparta uygulamalı bilimler üniversitesi
Ülke: Turkey


Orcid: 0000-0003-4397-6196
Yazar: Sinan UĞUZ
Kurum: ısparta uygulamalı bilimler üniversitesi
Ülke: Turkey


Orcid: 0000-0003-4256-0930
Yazar: Okan ORAL (Sorumlu Yazar)
Kurum: AKDENIZ UNIVERSITY
Ülke: Turkey


Tarihler

Yayımlanma Tarihi : 15 Eylül 2019

APA AKSOY, B , UĞUZ, S , ORAL, O . (2019). COMPARISON OF THE DATA MATCHING PERFORMANCES OF STRING SIMILARITY ALGORITHMS IN BIG DATA. Mühendislik Bilimleri ve Tasarım Dergisi , 7 (3) , 608-618 . DOI: 10.21923/jesd.467036