Research Article
BibTex RIS Cite

DNA sekansları için q-gram hash karşılaştırmasına dayalı çoklu kesin dizi eşleştirme algoritması

Year 2023, Volume: 38 Issue: 2, 875 - 888, 07.10.2022
https://doi.org/10.17341/gazimmfd.951157

Abstract

Dizi eşleştirme algoritmaları tıp, biyoinformatik, biyoloji gibi birçok alandaki çeşitli uygulamaları nedeniyle bilgisayar bilimindeki önemli çalışma konularından olmuştur. Son yıllarda yeni algoritmalar geliştirilerek metin üzerinde dizi eşleştirme işlemleri hızlandırılmıştır. Dizi eşleştirme algoritmaları tekli ve çoklu olmak üzere iki kısma ayrılır. Çoklu kesin dizi eşleştirme algoritmaları verilen bir T metni içinde d adet P desenlerinin bulunmasını içerir. Bu çalışmada, hash tabanlı çoklu kesin dizi eşleştirme algoritmalarından olan Wu-Manber algoritması ele alınmıştır. Wu-Manber algoritması etkili bir algoritma olmasına rağmen hash çakışmaları gibi bazı kısıtlamalara sahiptir. Çalışmamızda bu eksikliklere yönelik yeni yaklaşım önerilmiştir. Önerilen yaklaşımda, geleneksel Wu-Manber algoritmasının aksine, DNA sekanslarında hash çakışmasını kaldıran hash fonksiyonu kullanarak dizilerdeki arama işlemi q-gram hash karşılaştırması ile gerçekleştirilmiştir. Önerilen yaklaşım literatürde sıkça kullanılan çoklu kesin dizi eşleştirme algoritmalarıyla E. Coli ve Human Chromosome1 veri setinde karşılaştırmalar yapılmıştır. Yapılan deneysel çalışmalar sonucu önerilen yöntemin Wu-Manber algoritmasına kıyasla önerilen yaklaşımda ortalama çalışma zamanı, ortalama karakter ve hash karşılaştırma sayısı gibi performans metrikleri açısından daha iyi sonuçlar elde edilmiştir. Ayrıca, önerilen yaklaşımın Aho Corasick (AC) ve Commentz Walter (CW) gibi iyi bilinen algoritmalardan daha verimli olduğu gösterilmiştir.

References

  • 1. Sukhanov S., Wu R., Debes C., Zoubir A. M. Dynamic pattern matching with multiple queries on large scale data streams. Signal Processing, 171, 107402, 2020. Doi: 10.1016/j.sigpro.2019.107402
  • 2. Song S., Gu G., Ryu C., Faro S., Lecroq T., Park K. Fast algorithms for single and multiple pattern Cartesian tree matching. Theoretical Computer Science, 849, 47-63, 2021. Doi: 10.1016/j.tcs.2020.10.009
  • 3. Aldwairi M., Hamzah A. Y., Jarrah M. MultiPLZW: a novel multiple pattern matching search in LZW-compressed data. Computer Communications, 145, 126-136, 2019. Doi: 10.1016/j.comcom.2019.06.011
  • 4. Kumar S., Singh S., Khatoon A., Agarwal S. A Multiple String and Pattern Matching Algorithm Using Context-Free Grammar, In Emerging Trends in Expert Applications and Security, Springer, Singapore, vol. 841, 97-102, 2019. Doi: 10.1007/978-981-13-2285-3_12
  • 5. Singh M., Sharma V. ASCII based Sequential Multiple Pattern Matching Algorithm for High Level Cloning. INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 8(6), 271-276, 2017. Link: https://pdfs.semanticscholar.org/df05/c9dda727a6ed18a3b840e1a3f53abbd71ee4.pdf
  • 6. Faro S., Külekci M. O. Towards a Very Fast Multiple String Matching Algorithm for Short Patterns, In Stringology, (pp. 78-91), 2013. Link: Scholar
  • 7. Ho T., Cho S., Oh S., Parallel multiple pattern matching schemes based on Cuckoo filter for deep packet inspection on graphics processing units, IET Inf. Secur, 12(4), 381–388, 2018. Doi: 10.1049/iet-ifs.2017.042
  • 8. Nunes L. S. N., Bordim J. L., Ito Y., Nakano K., A Rabin-Karp Implementation for Handling Multiple Pattern-Matching on the GPU, IEICE Transactions on Information and Systems, 103(12), 2412-2420, 2020. Doi: 10.1587/transinf.2020PAP0002
  • 9. Lai W. S., Wu C. C., Lai L. F., Sie M. C., Two-Phase PFAC Algorithm for Multiple Patterns Matching on CUDA GPUs, Electronics, 8(3), 270, 2019. Doi: 10.3390/electronics8030270
  • 10. Rasool A., Ahmed G. F., Barskar R., Khare N., Efficient multiple pattern matching algorithm based on BMH: MP-BMH, Int. Arab J. Inf. Technol., 16(6), 1121-1130, 2019. Link: https://iajit.org/PDF/November%202019,%20No.%206/15585.pdf
  • 11. Lin C., Li J., Liu C., Chang S., Perfect Hashing Based Parallel Algorithms for Multiple String Matching on Graphic Processing Units, IEEE Transactions on Parallel and Distributed Systems, vol. 28, no. 9, 2639-2650, 2017. doi: 10.1109/TPDS.2017.2674664
  • 12. Karcioglu A. A., Bulut H., Improving hash-q exact string matching algorithm with perfect hashing for DNA sequences, Computers in Biology and Medicine, 131, 104292, 2021. Doi: 10.1016/j.compbiomed.2021.104292
  • 13. Hyyrö H., Juhola M., Vihinen M., On exact string matching of unique oligonucleotides, Computers in Biology and Medicine, 35(2), 173-181, 2005. Doi: 10.1016/j.compbiomed.2004.06.002
  • 14. Xu K., Yang Z., Kang P., Wang Q., Liu W., Document-level attention-based BiLSTM-CRF incorporating disease dictionary for disease named entity recognition, Computers in biology and medicine, 108, 122-132, 2019. Doi: 10.1016/j.compbiomed.2019.04.002
  • 15. Kim H., Han Y. S., OMPPM: online multiple palindrome pattern matching, Bioinformatics, 32(8), 1151-1157, 2016. Doi: 10.1093/bioinformatics/btv738
  • 16. Yang X., Han G., Chen J., Cai H., Finding Correlated Patterns via High-Order Matching for Multiple Sourced Biological Data, IEEE Transactions on Biomedical Engineering, vol. 66, no. 4, 1017-1025, 2019, Doi: 10.1109/TBME.2018.2866266
  • 17. Al-Qiari A. K., Al-Issa Y., A fast improved multiple pattern matching algorithm, IEEE 9th International Conference on Information and Communication Systems (ICICS), 55-60,2018. Doi: 10.1109/IACS.2018.8355441
  • 18. Kouzinopoulos C. S., Margaritis K. G., Multiple pattern matching: Survey and experimental results, Neural, Parallel and Scientific Computation, 22, 563-593, 2014. Link: http://www.dynamicpublishers.com/Neural/NPSC2014/NPSC-21-2013-4/08-NPSC-34-SA-04.pdf
  • 19. Aho A.V., Corasick M.J., Efficient string matching: an aid to bibliographic search, Communications of the ACM, 18(6), 333–340, 1975. Doi: 10.1145/360825.360855
  • 20. Knuth D.E., Morris J.H, Pratt V.R., Fast pattern matching in strings, SIAM Journal on Computing, 6(2), 323–350, 1977. Doi: 10.1137/0206024
  • 21. Commentz-Walter B., A string matching algorithm fast on the average, Proceedings of the 6th Colloquium on Automata Languages and Programming, 118–132, 1979. Doi:10.1007/3-540-09510-1_10
  • 22. Boyer R. S., Moore J. S., A fast string searching algorithm, Communications of the ACM, 20(10), 762-772, 1977. Doi: 10.1145/359842.359859
  • 23. Navarro G., Raffinot M., Flexible pattern matching in strings: practical on-line search algorithms for texts and biological sequences, Cambridge university press, 2002.
  • 24. Wu S., Manber U., A fast algorithm for multi-pattern searching, Dept. Comput. Sci., Univ. Arizona, Tucson, AZ, USA, Tech. Rep, TR-94-171994, 1994. Link: https://www.cs.arizona.edu/sites/default/files/TR94-17.pdf
  • 25. Plunkett G., E. Coli veri seti, Escherichia coli str. K-12 substr. DH10B, complete sequence. https://www.ncbi.nlm.nih.gov/nuccore/NC_010473.1.Yayın tarihi 20 Şubat 2008. Erişim tarihi Ocak 11, 2021.
  • 26. Lander E. S., vd. Homo sapiens chromosome, GRCh38.p13 Primary Assembly. https://www.ncbi.nlm.nih.gov/nuccore/NC_000001.11 Yayın tarihi 26 Ekim 2006. Erişim tarihi Ocak 11, 2021.

Multiple exact string matching algorithm based on q-gram hash comparison for DNA sequences

Year 2023, Volume: 38 Issue: 2, 875 - 888, 07.10.2022
https://doi.org/10.17341/gazimmfd.951157

Abstract

The exact string matching algorithms are among the important study topics in computer science due to their various applications in many fields such as medicine, bioinformatics, and biology. New algorithms have been developed recently, and the string matching on the text has been accelerated. The string matching algorithms are divided into two parts, single and multiple. . The string matching algorithms are divided into two parts, single and multiple. The multiple exact string matching algorithms involve finding d number patterns (P ) in a given text T. In this study, the Wu-Manber algorithm, one of the hash-based multiple exact string matching algorithms, is discussed. Although the Wu-Manber algorithm is effective, it has some limitations, such as hash collisions. In our study, a new approach has is proposed for these limitations. In the proposed approach, unlike the traditional Wu-Manber algorithm, the searching in the sequences is performed by q-gram hash comparison, using the hash function that removes hash collisions in DNA sequences. The proposed approach has been compared with the multiple exact string matching algorithms with the well-known algorithms in the literature on E. Coli and Human Chromosome1 datasets. As a result of the experimental studies, better results have been achieved in terms of performance metrics such as the average runtime, the average number of character and hash comparisons in the proposed approach compared to the Wu-Manber algorithm. Also, the proposed approach is shown to be more efficient than well-known algorithms, such as Aho Corasick (AC) and Commentz Walter (CW).

References

  • 1. Sukhanov S., Wu R., Debes C., Zoubir A. M. Dynamic pattern matching with multiple queries on large scale data streams. Signal Processing, 171, 107402, 2020. Doi: 10.1016/j.sigpro.2019.107402
  • 2. Song S., Gu G., Ryu C., Faro S., Lecroq T., Park K. Fast algorithms for single and multiple pattern Cartesian tree matching. Theoretical Computer Science, 849, 47-63, 2021. Doi: 10.1016/j.tcs.2020.10.009
  • 3. Aldwairi M., Hamzah A. Y., Jarrah M. MultiPLZW: a novel multiple pattern matching search in LZW-compressed data. Computer Communications, 145, 126-136, 2019. Doi: 10.1016/j.comcom.2019.06.011
  • 4. Kumar S., Singh S., Khatoon A., Agarwal S. A Multiple String and Pattern Matching Algorithm Using Context-Free Grammar, In Emerging Trends in Expert Applications and Security, Springer, Singapore, vol. 841, 97-102, 2019. Doi: 10.1007/978-981-13-2285-3_12
  • 5. Singh M., Sharma V. ASCII based Sequential Multiple Pattern Matching Algorithm for High Level Cloning. INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 8(6), 271-276, 2017. Link: https://pdfs.semanticscholar.org/df05/c9dda727a6ed18a3b840e1a3f53abbd71ee4.pdf
  • 6. Faro S., Külekci M. O. Towards a Very Fast Multiple String Matching Algorithm for Short Patterns, In Stringology, (pp. 78-91), 2013. Link: Scholar
  • 7. Ho T., Cho S., Oh S., Parallel multiple pattern matching schemes based on Cuckoo filter for deep packet inspection on graphics processing units, IET Inf. Secur, 12(4), 381–388, 2018. Doi: 10.1049/iet-ifs.2017.042
  • 8. Nunes L. S. N., Bordim J. L., Ito Y., Nakano K., A Rabin-Karp Implementation for Handling Multiple Pattern-Matching on the GPU, IEICE Transactions on Information and Systems, 103(12), 2412-2420, 2020. Doi: 10.1587/transinf.2020PAP0002
  • 9. Lai W. S., Wu C. C., Lai L. F., Sie M. C., Two-Phase PFAC Algorithm for Multiple Patterns Matching on CUDA GPUs, Electronics, 8(3), 270, 2019. Doi: 10.3390/electronics8030270
  • 10. Rasool A., Ahmed G. F., Barskar R., Khare N., Efficient multiple pattern matching algorithm based on BMH: MP-BMH, Int. Arab J. Inf. Technol., 16(6), 1121-1130, 2019. Link: https://iajit.org/PDF/November%202019,%20No.%206/15585.pdf
  • 11. Lin C., Li J., Liu C., Chang S., Perfect Hashing Based Parallel Algorithms for Multiple String Matching on Graphic Processing Units, IEEE Transactions on Parallel and Distributed Systems, vol. 28, no. 9, 2639-2650, 2017. doi: 10.1109/TPDS.2017.2674664
  • 12. Karcioglu A. A., Bulut H., Improving hash-q exact string matching algorithm with perfect hashing for DNA sequences, Computers in Biology and Medicine, 131, 104292, 2021. Doi: 10.1016/j.compbiomed.2021.104292
  • 13. Hyyrö H., Juhola M., Vihinen M., On exact string matching of unique oligonucleotides, Computers in Biology and Medicine, 35(2), 173-181, 2005. Doi: 10.1016/j.compbiomed.2004.06.002
  • 14. Xu K., Yang Z., Kang P., Wang Q., Liu W., Document-level attention-based BiLSTM-CRF incorporating disease dictionary for disease named entity recognition, Computers in biology and medicine, 108, 122-132, 2019. Doi: 10.1016/j.compbiomed.2019.04.002
  • 15. Kim H., Han Y. S., OMPPM: online multiple palindrome pattern matching, Bioinformatics, 32(8), 1151-1157, 2016. Doi: 10.1093/bioinformatics/btv738
  • 16. Yang X., Han G., Chen J., Cai H., Finding Correlated Patterns via High-Order Matching for Multiple Sourced Biological Data, IEEE Transactions on Biomedical Engineering, vol. 66, no. 4, 1017-1025, 2019, Doi: 10.1109/TBME.2018.2866266
  • 17. Al-Qiari A. K., Al-Issa Y., A fast improved multiple pattern matching algorithm, IEEE 9th International Conference on Information and Communication Systems (ICICS), 55-60,2018. Doi: 10.1109/IACS.2018.8355441
  • 18. Kouzinopoulos C. S., Margaritis K. G., Multiple pattern matching: Survey and experimental results, Neural, Parallel and Scientific Computation, 22, 563-593, 2014. Link: http://www.dynamicpublishers.com/Neural/NPSC2014/NPSC-21-2013-4/08-NPSC-34-SA-04.pdf
  • 19. Aho A.V., Corasick M.J., Efficient string matching: an aid to bibliographic search, Communications of the ACM, 18(6), 333–340, 1975. Doi: 10.1145/360825.360855
  • 20. Knuth D.E., Morris J.H, Pratt V.R., Fast pattern matching in strings, SIAM Journal on Computing, 6(2), 323–350, 1977. Doi: 10.1137/0206024
  • 21. Commentz-Walter B., A string matching algorithm fast on the average, Proceedings of the 6th Colloquium on Automata Languages and Programming, 118–132, 1979. Doi:10.1007/3-540-09510-1_10
  • 22. Boyer R. S., Moore J. S., A fast string searching algorithm, Communications of the ACM, 20(10), 762-772, 1977. Doi: 10.1145/359842.359859
  • 23. Navarro G., Raffinot M., Flexible pattern matching in strings: practical on-line search algorithms for texts and biological sequences, Cambridge university press, 2002.
  • 24. Wu S., Manber U., A fast algorithm for multi-pattern searching, Dept. Comput. Sci., Univ. Arizona, Tucson, AZ, USA, Tech. Rep, TR-94-171994, 1994. Link: https://www.cs.arizona.edu/sites/default/files/TR94-17.pdf
  • 25. Plunkett G., E. Coli veri seti, Escherichia coli str. K-12 substr. DH10B, complete sequence. https://www.ncbi.nlm.nih.gov/nuccore/NC_010473.1.Yayın tarihi 20 Şubat 2008. Erişim tarihi Ocak 11, 2021.
  • 26. Lander E. S., vd. Homo sapiens chromosome, GRCh38.p13 Primary Assembly. https://www.ncbi.nlm.nih.gov/nuccore/NC_000001.11 Yayın tarihi 26 Ekim 2006. Erişim tarihi Ocak 11, 2021.
There are 26 citations in total.

Details

Primary Language Turkish
Subjects Engineering
Journal Section Makaleler
Authors

Abdullah Ammar Karcıoğlu 0000-0002-0907-751X

Hasan Bulut 0000-0002-4872-5698

Publication Date October 7, 2022
Submission Date June 11, 2021
Acceptance Date April 16, 2022
Published in Issue Year 2023 Volume: 38 Issue: 2

Cite

APA Karcıoğlu, A. A., & Bulut, H. (2022). DNA sekansları için q-gram hash karşılaştırmasına dayalı çoklu kesin dizi eşleştirme algoritması. Gazi Üniversitesi Mühendislik Mimarlık Fakültesi Dergisi, 38(2), 875-888. https://doi.org/10.17341/gazimmfd.951157
AMA Karcıoğlu AA, Bulut H. DNA sekansları için q-gram hash karşılaştırmasına dayalı çoklu kesin dizi eşleştirme algoritması. GUMMFD. October 2022;38(2):875-888. doi:10.17341/gazimmfd.951157
Chicago Karcıoğlu, Abdullah Ammar, and Hasan Bulut. “DNA Sekansları için Q-Gram Hash karşılaştırmasına Dayalı çoklu Kesin Dizi eşleştirme Algoritması”. Gazi Üniversitesi Mühendislik Mimarlık Fakültesi Dergisi 38, no. 2 (October 2022): 875-88. https://doi.org/10.17341/gazimmfd.951157.
EndNote Karcıoğlu AA, Bulut H (October 1, 2022) DNA sekansları için q-gram hash karşılaştırmasına dayalı çoklu kesin dizi eşleştirme algoritması. Gazi Üniversitesi Mühendislik Mimarlık Fakültesi Dergisi 38 2 875–888.
IEEE A. A. Karcıoğlu and H. Bulut, “DNA sekansları için q-gram hash karşılaştırmasına dayalı çoklu kesin dizi eşleştirme algoritması”, GUMMFD, vol. 38, no. 2, pp. 875–888, 2022, doi: 10.17341/gazimmfd.951157.
ISNAD Karcıoğlu, Abdullah Ammar - Bulut, Hasan. “DNA Sekansları için Q-Gram Hash karşılaştırmasına Dayalı çoklu Kesin Dizi eşleştirme Algoritması”. Gazi Üniversitesi Mühendislik Mimarlık Fakültesi Dergisi 38/2 (October 2022), 875-888. https://doi.org/10.17341/gazimmfd.951157.
JAMA Karcıoğlu AA, Bulut H. DNA sekansları için q-gram hash karşılaştırmasına dayalı çoklu kesin dizi eşleştirme algoritması. GUMMFD. 2022;38:875–888.
MLA Karcıoğlu, Abdullah Ammar and Hasan Bulut. “DNA Sekansları için Q-Gram Hash karşılaştırmasına Dayalı çoklu Kesin Dizi eşleştirme Algoritması”. Gazi Üniversitesi Mühendislik Mimarlık Fakültesi Dergisi, vol. 38, no. 2, 2022, pp. 875-88, doi:10.17341/gazimmfd.951157.
Vancouver Karcıoğlu AA, Bulut H. DNA sekansları için q-gram hash karşılaştırmasına dayalı çoklu kesin dizi eşleştirme algoritması. GUMMFD. 2022;38(2):875-88.