DNA sekansları için q-gram hash karşılaştırmasına dayalı çoklu kesin dizi eşleştirme algoritması

Abdullah Ammar Karcıoğlu; Hasan Bulut

doi:10.17341/gazimmfd.951157

Araştırma Makalesi

DNA sekansları için q-gram hash karşılaştırmasına dayalı çoklu kesin dizi eşleştirme algoritması

Yıl 2023, Cilt: 38 Sayı: 2, 875 - 888, 07.10.2022

Abdullah Ammar Karcıoğlu , Hasan Bulut

https://doi.org/10.17341/gazimmfd.951157

Cited By: 3

Öz

Dizi eşleştirme algoritmaları tıp, biyoinformatik, biyoloji gibi birçok alandaki çeşitli uygulamaları nedeniyle bilgisayar bilimindeki önemli çalışma konularından olmuştur. Son yıllarda yeni algoritmalar geliştirilerek metin üzerinde dizi eşleştirme işlemleri hızlandırılmıştır. Dizi eşleştirme algoritmaları tekli ve çoklu olmak üzere iki kısma ayrılır. Çoklu kesin dizi eşleştirme algoritmaları verilen bir T metni içinde d adet P desenlerinin bulunmasını içerir. Bu çalışmada, hash tabanlı çoklu kesin dizi eşleştirme algoritmalarından olan Wu-Manber algoritması ele alınmıştır. Wu-Manber algoritması etkili bir algoritma olmasına rağmen hash çakışmaları gibi bazı kısıtlamalara sahiptir. Çalışmamızda bu eksikliklere yönelik yeni yaklaşım önerilmiştir. Önerilen yaklaşımda, geleneksel Wu-Manber algoritmasının aksine, DNA sekanslarında hash çakışmasını kaldıran hash fonksiyonu kullanarak dizilerdeki arama işlemi q-gram hash karşılaştırması ile gerçekleştirilmiştir. Önerilen yaklaşım literatürde sıkça kullanılan çoklu kesin dizi eşleştirme algoritmalarıyla E. Coli ve Human Chromosome1 veri setinde karşılaştırmalar yapılmıştır. Yapılan deneysel çalışmalar sonucu önerilen yöntemin Wu-Manber algoritmasına kıyasla önerilen yaklaşımda ortalama çalışma zamanı, ortalama karakter ve hash karşılaştırma sayısı gibi performans metrikleri açısından daha iyi sonuçlar elde edilmiştir. Ayrıca, önerilen yaklaşımın Aho Corasick (AC) ve Commentz Walter (CW) gibi iyi bilinen algoritmalardan daha verimli olduğu gösterilmiştir.

Anahtar Kelimeler

Çoklu dizi eşleştirme algoritmaları , desen eşleştirme , sekans analizi , hash fonksiyonu , wu-manber algoritması , Çoklu dizi eşleştirme algoritmaları, desen eşleştirme, sekans analizi, hash fonksiyonu, wu-manber algoritması

Kaynakça

1. Sukhanov S., Wu R., Debes C., Zoubir A. M. Dynamic pattern matching with multiple queries on large scale data streams. Signal Processing, 171, 107402, 2020. Doi: 10.1016/j.sigpro.2019.107402
2. Song S., Gu G., Ryu C., Faro S., Lecroq T., Park K. Fast algorithms for single and multiple pattern Cartesian tree matching. Theoretical Computer Science, 849, 47-63, 2021. Doi: 10.1016/j.tcs.2020.10.009
3. Aldwairi M., Hamzah A. Y., Jarrah M. MultiPLZW: a novel multiple pattern matching search in LZW-compressed data. Computer Communications, 145, 126-136, 2019. Doi: 10.1016/j.comcom.2019.06.011
4. Kumar S., Singh S., Khatoon A., Agarwal S. A Multiple String and Pattern Matching Algorithm Using Context-Free Grammar, In Emerging Trends in Expert Applications and Security, Springer, Singapore, vol. 841, 97-102, 2019. Doi: 10.1007/978-981-13-2285-3_12
5. Singh M., Sharma V. ASCII based Sequential Multiple Pattern Matching Algorithm for High Level Cloning. INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 8(6), 271-276, 2017. Link: https://pdfs.semanticscholar.org/df05/c9dda727a6ed18a3b840e1a3f53abbd71ee4.pdf
6. Faro S., Külekci M. O. Towards a Very Fast Multiple String Matching Algorithm for Short Patterns, In Stringology, (pp. 78-91), 2013. Link: Scholar
7. Ho T., Cho S., Oh S., Parallel multiple pattern matching schemes based on Cuckoo filter for deep packet inspection on graphics processing units, IET Inf. Secur, 12(4), 381–388, 2018. Doi: 10.1049/iet-ifs.2017.042
8. Nunes L. S. N., Bordim J. L., Ito Y., Nakano K., A Rabin-Karp Implementation for Handling Multiple Pattern-Matching on the GPU, IEICE Transactions on Information and Systems, 103(12), 2412-2420, 2020. Doi: 10.1587/transinf.2020PAP0002
9. Lai W. S., Wu C. C., Lai L. F., Sie M. C., Two-Phase PFAC Algorithm for Multiple Patterns Matching on CUDA GPUs, Electronics, 8(3), 270, 2019. Doi: 10.3390/electronics8030270
10. Rasool A., Ahmed G. F., Barskar R., Khare N., Efficient multiple pattern matching algorithm based on BMH: MP-BMH, Int. Arab J. Inf. Technol., 16(6), 1121-1130, 2019. Link: https://iajit.org/PDF/November%202019,%20No.%206/15585.pdf
11. Lin C., Li J., Liu C., Chang S., Perfect Hashing Based Parallel Algorithms for Multiple String Matching on Graphic Processing Units, IEEE Transactions on Parallel and Distributed Systems, vol. 28, no. 9, 2639-2650, 2017. doi: 10.1109/TPDS.2017.2674664
12. Karcioglu A. A., Bulut H., Improving hash-q exact string matching algorithm with perfect hashing for DNA sequences, Computers in Biology and Medicine, 131, 104292, 2021. Doi: 10.1016/j.compbiomed.2021.104292
13. Hyyrö H., Juhola M., Vihinen M., On exact string matching of unique oligonucleotides, Computers in Biology and Medicine, 35(2), 173-181, 2005. Doi: 10.1016/j.compbiomed.2004.06.002
14. Xu K., Yang Z., Kang P., Wang Q., Liu W., Document-level attention-based BiLSTM-CRF incorporating disease dictionary for disease named entity recognition, Computers in biology and medicine, 108, 122-132, 2019. Doi: 10.1016/j.compbiomed.2019.04.002
15. Kim H., Han Y. S., OMPPM: online multiple palindrome pattern matching, Bioinformatics, 32(8), 1151-1157, 2016. Doi: 10.1093/bioinformatics/btv738
16. Yang X., Han G., Chen J., Cai H., Finding Correlated Patterns via High-Order Matching for Multiple Sourced Biological Data, IEEE Transactions on Biomedical Engineering, vol. 66, no. 4, 1017-1025, 2019, Doi: 10.1109/TBME.2018.2866266
17. Al-Qiari A. K., Al-Issa Y., A fast improved multiple pattern matching algorithm, IEEE 9th International Conference on Information and Communication Systems (ICICS), 55-60,2018. Doi: 10.1109/IACS.2018.8355441
18. Kouzinopoulos C. S., Margaritis K. G., Multiple pattern matching: Survey and experimental results, Neural, Parallel and Scientific Computation, 22, 563-593, 2014. Link: http://www.dynamicpublishers.com/Neural/NPSC2014/NPSC-21-2013-4/08-NPSC-34-SA-04.pdf
19. Aho A.V., Corasick M.J., Efficient string matching: an aid to bibliographic search, Communications of the ACM, 18(6), 333–340, 1975. Doi: 10.1145/360825.360855
20. Knuth D.E., Morris J.H, Pratt V.R., Fast pattern matching in strings, SIAM Journal on Computing, 6(2), 323–350, 1977. Doi: 10.1137/0206024
21. Commentz-Walter B., A string matching algorithm fast on the average, Proceedings of the 6th Colloquium on Automata Languages and Programming, 118–132, 1979. Doi:10.1007/3-540-09510-1_10
22. Boyer R. S., Moore J. S., A fast string searching algorithm, Communications of the ACM, 20(10), 762-772, 1977. Doi: 10.1145/359842.359859
23. Navarro G., Raffinot M., Flexible pattern matching in strings: practical on-line search algorithms for texts and biological sequences, Cambridge university press, 2002.
24. Wu S., Manber U., A fast algorithm for multi-pattern searching, Dept. Comput. Sci., Univ. Arizona, Tucson, AZ, USA, Tech. Rep, TR-94-171994, 1994. Link: https://www.cs.arizona.edu/sites/default/files/TR94-17.pdf
25. Plunkett G., E. Coli veri seti, Escherichia coli str. K-12 substr. DH10B, complete sequence. https://www.ncbi.nlm.nih.gov/nuccore/NC_010473.1.Yayın tarihi 20 Şubat 2008. Erişim tarihi Ocak 11, 2021.
26. Lander E. S., vd. Homo sapiens chromosome, GRCh38.p13 Primary Assembly. https://www.ncbi.nlm.nih.gov/nuccore/NC_000001.11 Yayın tarihi 26 Ekim 2006. Erişim tarihi Ocak 11, 2021.

Multiple exact string matching algorithm based on q-gram hash comparison for DNA sequences

Yıl 2023, Cilt: 38 Sayı: 2, 875 - 888, 07.10.2022

Abdullah Ammar Karcıoğlu , Hasan Bulut

https://doi.org/10.17341/gazimmfd.951157

Cited By: 3

Öz

The exact string matching algorithms are among the important study topics in computer science due to their various applications in many fields such as medicine, bioinformatics, and biology. New algorithms have been developed recently, and the string matching on the text has been accelerated. The string matching algorithms are divided into two parts, single and multiple. . The string matching algorithms are divided into two parts, single and multiple. The multiple exact string matching algorithms involve finding d number patterns (P ) in a given text T. In this study, the Wu-Manber algorithm, one of the hash-based multiple exact string matching algorithms, is discussed. Although the Wu-Manber algorithm is effective, it has some limitations, such as hash collisions. In our study, a new approach has is proposed for these limitations. In the proposed approach, unlike the traditional Wu-Manber algorithm, the searching in the sequences is performed by q-gram hash comparison, using the hash function that removes hash collisions in DNA sequences. The proposed approach has been compared with the multiple exact string matching algorithms with the well-known algorithms in the literature on E. Coli and Human Chromosome1 datasets. As a result of the experimental studies, better results have been achieved in terms of performance metrics such as the average runtime, the average number of character and hash comparisons in the proposed approach compared to the Wu-Manber algorithm. Also, the proposed approach is shown to be more efficient than well-known algorithms, such as Aho Corasick (AC) and Commentz Walter (CW).

Anahtar Kelimeler

Multiple string matching , pattern matching , sequence analysis , hash function , wu-manber algorithm , Multiple string matching, pattern matching, sequence analysis, hash function, wu-manber algorithm

Kaynakça

1. Sukhanov S., Wu R., Debes C., Zoubir A. M. Dynamic pattern matching with multiple queries on large scale data streams. Signal Processing, 171, 107402, 2020. Doi: 10.1016/j.sigpro.2019.107402
2. Song S., Gu G., Ryu C., Faro S., Lecroq T., Park K. Fast algorithms for single and multiple pattern Cartesian tree matching. Theoretical Computer Science, 849, 47-63, 2021. Doi: 10.1016/j.tcs.2020.10.009
3. Aldwairi M., Hamzah A. Y., Jarrah M. MultiPLZW: a novel multiple pattern matching search in LZW-compressed data. Computer Communications, 145, 126-136, 2019. Doi: 10.1016/j.comcom.2019.06.011
4. Kumar S., Singh S., Khatoon A., Agarwal S. A Multiple String and Pattern Matching Algorithm Using Context-Free Grammar, In Emerging Trends in Expert Applications and Security, Springer, Singapore, vol. 841, 97-102, 2019. Doi: 10.1007/978-981-13-2285-3_12
5. Singh M., Sharma V. ASCII based Sequential Multiple Pattern Matching Algorithm for High Level Cloning. INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 8(6), 271-276, 2017. Link: https://pdfs.semanticscholar.org/df05/c9dda727a6ed18a3b840e1a3f53abbd71ee4.pdf
6. Faro S., Külekci M. O. Towards a Very Fast Multiple String Matching Algorithm for Short Patterns, In Stringology, (pp. 78-91), 2013. Link: Scholar
7. Ho T., Cho S., Oh S., Parallel multiple pattern matching schemes based on Cuckoo filter for deep packet inspection on graphics processing units, IET Inf. Secur, 12(4), 381–388, 2018. Doi: 10.1049/iet-ifs.2017.042
8. Nunes L. S. N., Bordim J. L., Ito Y., Nakano K., A Rabin-Karp Implementation for Handling Multiple Pattern-Matching on the GPU, IEICE Transactions on Information and Systems, 103(12), 2412-2420, 2020. Doi: 10.1587/transinf.2020PAP0002
9. Lai W. S., Wu C. C., Lai L. F., Sie M. C., Two-Phase PFAC Algorithm for Multiple Patterns Matching on CUDA GPUs, Electronics, 8(3), 270, 2019. Doi: 10.3390/electronics8030270
10. Rasool A., Ahmed G. F., Barskar R., Khare N., Efficient multiple pattern matching algorithm based on BMH: MP-BMH, Int. Arab J. Inf. Technol., 16(6), 1121-1130, 2019. Link: https://iajit.org/PDF/November%202019,%20No.%206/15585.pdf
11. Lin C., Li J., Liu C., Chang S., Perfect Hashing Based Parallel Algorithms for Multiple String Matching on Graphic Processing Units, IEEE Transactions on Parallel and Distributed Systems, vol. 28, no. 9, 2639-2650, 2017. doi: 10.1109/TPDS.2017.2674664
12. Karcioglu A. A., Bulut H., Improving hash-q exact string matching algorithm with perfect hashing for DNA sequences, Computers in Biology and Medicine, 131, 104292, 2021. Doi: 10.1016/j.compbiomed.2021.104292
13. Hyyrö H., Juhola M., Vihinen M., On exact string matching of unique oligonucleotides, Computers in Biology and Medicine, 35(2), 173-181, 2005. Doi: 10.1016/j.compbiomed.2004.06.002
14. Xu K., Yang Z., Kang P., Wang Q., Liu W., Document-level attention-based BiLSTM-CRF incorporating disease dictionary for disease named entity recognition, Computers in biology and medicine, 108, 122-132, 2019. Doi: 10.1016/j.compbiomed.2019.04.002
15. Kim H., Han Y. S., OMPPM: online multiple palindrome pattern matching, Bioinformatics, 32(8), 1151-1157, 2016. Doi: 10.1093/bioinformatics/btv738
16. Yang X., Han G., Chen J., Cai H., Finding Correlated Patterns via High-Order Matching for Multiple Sourced Biological Data, IEEE Transactions on Biomedical Engineering, vol. 66, no. 4, 1017-1025, 2019, Doi: 10.1109/TBME.2018.2866266
17. Al-Qiari A. K., Al-Issa Y., A fast improved multiple pattern matching algorithm, IEEE 9th International Conference on Information and Communication Systems (ICICS), 55-60,2018. Doi: 10.1109/IACS.2018.8355441
18. Kouzinopoulos C. S., Margaritis K. G., Multiple pattern matching: Survey and experimental results, Neural, Parallel and Scientific Computation, 22, 563-593, 2014. Link: http://www.dynamicpublishers.com/Neural/NPSC2014/NPSC-21-2013-4/08-NPSC-34-SA-04.pdf
19. Aho A.V., Corasick M.J., Efficient string matching: an aid to bibliographic search, Communications of the ACM, 18(6), 333–340, 1975. Doi: 10.1145/360825.360855
20. Knuth D.E., Morris J.H, Pratt V.R., Fast pattern matching in strings, SIAM Journal on Computing, 6(2), 323–350, 1977. Doi: 10.1137/0206024
21. Commentz-Walter B., A string matching algorithm fast on the average, Proceedings of the 6th Colloquium on Automata Languages and Programming, 118–132, 1979. Doi:10.1007/3-540-09510-1_10
22. Boyer R. S., Moore J. S., A fast string searching algorithm, Communications of the ACM, 20(10), 762-772, 1977. Doi: 10.1145/359842.359859
23. Navarro G., Raffinot M., Flexible pattern matching in strings: practical on-line search algorithms for texts and biological sequences, Cambridge university press, 2002.
24. Wu S., Manber U., A fast algorithm for multi-pattern searching, Dept. Comput. Sci., Univ. Arizona, Tucson, AZ, USA, Tech. Rep, TR-94-171994, 1994. Link: https://www.cs.arizona.edu/sites/default/files/TR94-17.pdf
25. Plunkett G., E. Coli veri seti, Escherichia coli str. K-12 substr. DH10B, complete sequence. https://www.ncbi.nlm.nih.gov/nuccore/NC_010473.1.Yayın tarihi 20 Şubat 2008. Erişim tarihi Ocak 11, 2021.
26. Lander E. S., vd. Homo sapiens chromosome, GRCh38.p13 Primary Assembly. https://www.ncbi.nlm.nih.gov/nuccore/NC_000001.11 Yayın tarihi 26 Ekim 2006. Erişim tarihi Ocak 11, 2021.

Toplam 26 adet kaynakça vardır.

Ayrıntılar

Birincil Dil	Türkçe
Konular	Mühendislik
Bölüm	Makaleler
Yazarlar	Abdullah Ammar Karcıoğlu 0000-0002-0907-751X Hasan Bulut 0000-0002-4872-5698
Yayımlanma Tarihi	7 Ekim 2022
Gönderilme Tarihi	11 Haziran 2021
Kabul Tarihi	16 Nisan 2022
Yayımlandığı Sayı	Yıl 2023 Cilt: 38 Sayı: 2

Kaynak Göster

APA	Karcıoğlu, A. A., & Bulut, H. (2022). DNA sekansları için q-gram hash karşılaştırmasına dayalı çoklu kesin dizi eşleştirme algoritması. Gazi Üniversitesi Mühendislik Mimarlık Fakültesi Dergisi, 38(2), 875-888. https://doi.org/10.17341/gazimmfd.951157
AMA	Karcıoğlu AA, Bulut H. DNA sekansları için q-gram hash karşılaştırmasına dayalı çoklu kesin dizi eşleştirme algoritması. GUMMFD. Ekim 2022;38(2):875-888. doi:10.17341/gazimmfd.951157
Chicago	Karcıoğlu, Abdullah Ammar, ve Hasan Bulut. “DNA sekansları için q-gram hash karşılaştırmasına dayalı çoklu kesin dizi eşleştirme algoritması”. Gazi Üniversitesi Mühendislik Mimarlık Fakültesi Dergisi 38, sy. 2 (Ekim 2022): 875-88. https://doi.org/10.17341/gazimmfd.951157.
EndNote	Karcıoğlu AA, Bulut H (01 Ekim 2022) DNA sekansları için q-gram hash karşılaştırmasına dayalı çoklu kesin dizi eşleştirme algoritması. Gazi Üniversitesi Mühendislik Mimarlık Fakültesi Dergisi 38 2 875–888.
IEEE	A. A. Karcıoğlu ve H. Bulut, “DNA sekansları için q-gram hash karşılaştırmasına dayalı çoklu kesin dizi eşleştirme algoritması”, GUMMFD, c. 38, sy. 2, ss. 875–888, 2022, doi: 10.17341/gazimmfd.951157.
ISNAD	Karcıoğlu, Abdullah Ammar - Bulut, Hasan. “DNA sekansları için q-gram hash karşılaştırmasına dayalı çoklu kesin dizi eşleştirme algoritması”. Gazi Üniversitesi Mühendislik Mimarlık Fakültesi Dergisi 38/2 (Ekim2022), 875-888. https://doi.org/10.17341/gazimmfd.951157.
JAMA	Karcıoğlu AA, Bulut H. DNA sekansları için q-gram hash karşılaştırmasına dayalı çoklu kesin dizi eşleştirme algoritması. GUMMFD. 2022;38:875–888.
MLA	Karcıoğlu, Abdullah Ammar ve Hasan Bulut. “DNA sekansları için q-gram hash karşılaştırmasına dayalı çoklu kesin dizi eşleştirme algoritması”. Gazi Üniversitesi Mühendislik Mimarlık Fakültesi Dergisi, c. 38, sy. 2, 2022, ss. 875-88, doi:10.17341/gazimmfd.951157.
Vancouver	Karcıoğlu AA, Bulut H. DNA sekansları için q-gram hash karşılaştırmasına dayalı çoklu kesin dizi eşleştirme algoritması. GUMMFD. 2022;38(2):875-88.

Cited By

CComp: Sıkıştırılmış kelime arama için paralel bir sıkıştırma algoritması

Gazi Üniversitesi Mühendislik Mimarlık Fakültesi Dergisi

https://doi.org/10.17341/gazimmfd.1199811

Meta-Sezgisel Tabanlı Clustal-SA Algoritmasını Kullanarak DNA Sekanslarında Çoklu Dizi Hizalama

Journal of the Institute of Science and Technology

https://doi.org/10.21597/jist.1404898

Predicting DNA sequence splice site based on graph convolutional network and DNA graph construction

Journal of King Saud University - Computer and Information Sciences

https://doi.org/10.1016/j.jksuci.2024.102089

Makale Dosyaları

Tam Metin