Araştırma Makalesi
BibTex RIS Kaynak Göster

CComp: Sıkıştırılmış kelime arama için paralel bir sıkıştırma algoritması

Yıl 2024, , 1933 - 1944, 20.05.2024
https://doi.org/10.17341/gazimmfd.1199811

Öz

Üretilen verilerin saklamasında alandan tasarruf etmek önemlidir. Sıkıştırma algoritmaları bu tasarrufu sağlamak amacıyla kullanılmaktadır. Saklanmak istenen veri bir kere sıkıştırılmakta fakat üzerinde arama yapmak amacıyla defalarca erişilmektedir. Bu sebeple sıkıştırılmış verinin en büyük dezavantajı bu verinin kullanılmak istendiğinde açılması gerekliliğidir. Hızlı bir açma algoritması ile veya açma işlemine ihtiyaç duymayan bir sıkıştırılmış arama yönteminin kullanılması ile bu dezavantajlı durum giderebilir. Sıkıştırılmış arama hem arama uzayının küçük olması hem de açma yapmaması sayesinde aç-ve-ara yöntemlere göre daha hızlı sonuçlar elde edebilmektedir. Bu makalede sıkıştırılmış arama desteği sunan paralel yarı statik kelime tabanlı bir sıkıştırma algoritması olan CComp sunulmuştur. CComp’un amacı diğer paralel sıkıştırma algoritmalarının hızında sıkıştırma-açma ve daha hızlı sıkıştırılmış arama yapmaktır. CComp sıkıştırma, açma ve arama işlemlerini paralel olarak gerçekleştirmektedir. CComp diğer paralel yöntemler ile karşılaştırılmıştır. Sonuçlarda gösterildiği gibi CComp’un sıkıştırma oranları diğer kelime tabanlı algoritmalarla paralel sonuçlar vermektedir. Sıkıştırılmış arama işleminde ise daha önce en iyi sonucu veren Zstd algoritmasına göre yaklaşık 7 kat daha hızlı arama sonuçları elde edilmiştir. Bu sonuçlar ile CComp sıkıştırılmış arama desteği sunan algoritmalara daha iyi bir alternatif olarak gösterilebilmektedir.

Destekleyen Kurum

-

Proje Numarası

-

Teşekkür

-

Kaynakça

  • Özköse H, Arı ES, Gencer C., Yesterday, Today and Tomorrow of Big Data. Procedia - Social and Behavioral Sciences, 195, 1042–1050, 2015.
  • Lawnik M, Pelka A, Kapczyński A., A New Way to Store Simple Text Files. Algorithms, 13, 101 2020.
  • Gupta A, Nigam S., A Review on Different Types of Lossless Data Compression Techniques. 2021.
  • Suneetha D, Kishore DR, Babu PN., A Compression Algorithm for DNA Palindrome Compression Technique, ITM Web of Conferences, Mumbai-Hindistan, 1-5, 27-28 Haziran 2020.
  • Rădescu R., Concordance Techniques in Lossless Data Compression of Text Files, 2021 12th International Symposium on Advanced Topics in Electrical Engineering (ATEE), Bükreş-Romanya, 1–4, 23-25 Mart 2021.
  • Abliz W, Wu H, Maimaiti M, Wushouer J, Abiderexiti K, Yibulayin T, Wumaier A., A Syllable-Based Technique for Uyghur Text Compression. Information, 11, 172, 2020.
  • Pandey M, Shrivastava S, Pandey S, Shridevi S., An Enhanced Data Compression Algorithm, 2020 International Conference on Emerging Trends in Information Technology and Engineering (ic-ETITE), Vellore-Hindistan, 1–4, 24-25 Şubat 2020.
  • Murugesan G., Codon Based Compression Algorithm for DNA Sequences with Two Bit Encoding. European Journal of Molecular & Clinical Medicine, 7, 33-41, 2020.
  • Silva M, Pratas D, Pinho AJ., Efficient DNA sequence compression with neural networks. GigaScience, 9(11), 1-15, 2020.
  • Ghuge S., Map and Trie based Compression Algorithm for Data Transmission, 2nd International Conference on Innovative Mechanisms for Industry Applications (ICIMIA), Bangalore-Hindistan, 137–141, 24-25 Şubat 2020.
  • Hilal TA, Hilal HA., Turkish Text Compression via Characters Encoding. Procedia Computer Science, 175, 286–91 2020.
  • Nguyen VH, Nguyen HT, Duong HN, Snasel V., n-Gram-based text compression. Computational intelligence and neuroscience, 2016, 1-12, 2016.
  • Demchenko Y, De Laat C, Membrey P., Defining architecture components of the Big Data Ecosystem, 2014 International conference on collaboration technologies and systems (CTS), Minneapolis-Minnesota-USA, 104–112, 19-23 Mayıs 2014.
  • Rattanaopas K, Kaewkeeree S., Improving Hadoop MapReduce performance with data compression: A study using wordcount job, 14th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON), Phuket-Tayland, 564–567, 27-30 Haziran 2017.
  • Bartík M, Ubik S, Kubalik P., LZ4 compression algorithm on FPGA, IEEE International Conference on Electronics, Circuits, and Systems (ICECS), Kahire-Mısır, 179–182, 6 Aralık 2015.
  • Guerra A, Lotero J, Isaza S., Performance comparison of sequential and parallel compression applications for DNA raw data. The Journal of Supercomputing, 72, 4696–717, 2016.
  • Sun Y, Gong X, Yang Y., Data compression and parallel computation model research under big data environment, International Conference on Computer Communication and Informatics (ICCCI), Lefkoşa-Kıbrıs, 1–4, 27-29 Eylül 2017.
  • Kumar VS, Nanjundiah R, Thazhuthaveetil MJ, Govindarajan R., Impact of message compression on the scalability of an atmospheric modeling application on clusters. Parallel Computing, 34, 1–16, 2008.
  • Ahmad I, He Y, Liou ML., Video compression with parallel processing. Parallel Computing, 28, 1039–78, 2002.
  • Adler M., pigz: A parallel implementation of gzip for modern multi-processor, multi-core machines. Jet Propulsion Laboratory, 2015.
  • Gilchrist J., Parallel data compression with bzip2, Proceedings of the 16th IASTED international conference on parallel and distributed computing and systems, Dallas-USA, 559–564, 14-16 Aralık 2004.
  • Bell T, Adjeroh D, Mukherjee A., Pattern matching in compressed texts and images. 2001.
  • Mishra SP, Singh CG, Prasad R., A review on compressed pattern matching. Perspectives in Science, 8, 727–9, 2016.
  • Karcıoğlu AA, Bulut H., DNA sekansları için q-gram hash karşılaştırmasına dayalı çoklu kesin dizi eşleştirme algoritması. Gazi Üniversitesi Mühendislik Mimarlık Fakültesi Dergisi, 38, 875–88, 2022.
  • BULUŞ HN, Carus A, Mesut A., A new word-based compression model allowing compressed pattern matching. Turkish Journal of Electrical Engineering & Computer Sciences, 25, 3607–22, 2017.
  • Öztürk E, Mesut A, Diri B., Multi-stream word-based compression algorithm for compressed text search. Arabian Journal for Science and Engineering, 43, 8209–21, 2018.
  • Srivastav S, Singh PK, Yadav D., A Method to Improve Exact Matching Results in Compressed Text using Parallel Wavelet Tree. Scalable Computing: Practice and Experience, 22, 387–400, 2021.
  • Russo LMS, Navarro G, Oliveira AL, Morales P., Approximate String Matching with Compressed Indexes. Algorithms, 2, 1105–36, 2009.
  • Melink S, Raghavan S, Yang B, Garcia-Molina H., Building a distributed full-text index for the web. ACM Transactions on Information Systems (TOIS), 19, 217–41, 2001.
  • Bast H, Buchhold B., An index for efficient semantic full-text search, Proceedings of the 22nd ACM international conference on Information & Knowledge Management, California-USA, 369–78, 27 Ekim - 1 Kasım 2013.
  • Deutsch P, others., GZIP file format specification version 4.3. 1996.
  • Deutsch P., Rfc1951: Deflate compressed data format specification version 1.3, RFC Editor, 1996.
  • Oswal S, Singh A, Kumari K., Deflate compression algorithm. International Journal of Engineering Research and General Science, 4, 430–6, 2016.
  • Aşşık MM, Oral M., Kanonik Huffman kod sözcükleri uzunluklarının evrim stratejileri algoritması ile belirlenmesi. Gazi Üniversitesi Mühendislik Mimarlık Fakültesi Dergisi, 38, 771–80, 2022.
  • Deutsch P, Gailly J-L., Zlib compressed data format specification version 3.3, RFC 1950, Mayıs, 1996.
  • Burrows M, Wheeler D., A block-sorting lossless data compression algorithm, Digital SRC Research Report, 1994.
  • Manzini G., An analysis of the Burrows—Wheeler transform. Journal of the ACM (JACM), 48, 407–30, 2001.
  • Collet Y, Kucherawy M., Zstandard Compression and the application/zstd Media Type. RFC 8478, 2018.
  • Duda J, Tahboub K, Gadgil NJ, Delp EJ., The use of asymmetric numeral systems as an accurate replacement for Huffman coding, 2015 Picture Coding Symposium (PCS), Cairns-Avustralya, 65–69, 31 Mayıs - 3 Haziran 2015.
  • Belkov R, Kirilenko I., Compressing Embedded GNU/Linux and Windows 10 IoT Images Using XZ Utilities, 1st Scientific and Practical Conference “Software Engineering and Information Organization”, SEIM-2016, St Petersburg-Rusya, 41, 22 Nisan 2016.
  • Kirby G., Zipf’s law. UK Journal of Naval Science, 10, 180–5, 1985.
  • Ferragina P, Navarro G., Pizza&Chili Corpus—Compressed indexes and their testbeds. September, 2005.
  • Mahoney M., Large text compression benchmark, http://www.mattmahoney.net/dc/text.html, Yayın tarihi Eylül 15, 2022. Erişim tarihi Mayıs 19, 2022.

CComp: A parallel compression algorithm for compressed word search

Yıl 2024, , 1933 - 1944, 20.05.2024
https://doi.org/10.17341/gazimmfd.1199811

Öz

It is important to save space storing the generated data. To achieve this, compression algorithms are used. Stored data is compressed once but accessed many times to search on it. For this reason, the biggest disadvantage of compressed data is that it needs to be decompressed when it will be used. This disadvantage can be eliminated by using a fast decompression algorithm or a compressed search method that does not require decompression. Compressed search can achieve faster results than open-and-search methods, thanks to its small search space and not using decompression. In this article, CComp, a parallel semi-static word-based compression algorithm that supports compressed search, is presented. The purpose of CComp is to obtain faster search results while compressing-decompressing at the speed of other parallel compression algorithms. CComp performs these operations in parallel. CComp has been compared to other parallel methods. As shown in the results, the compression ratios of CComp give results in parallel with other word-based algorithms. In the compressed search process, results were obtained approximately 7 times faster than the Zstd algorithm, which gave the best results before. With these results, CComp can be shown as a better alternative to algorithms that support compressed search.

Proje Numarası

-

Kaynakça

  • Özköse H, Arı ES, Gencer C., Yesterday, Today and Tomorrow of Big Data. Procedia - Social and Behavioral Sciences, 195, 1042–1050, 2015.
  • Lawnik M, Pelka A, Kapczyński A., A New Way to Store Simple Text Files. Algorithms, 13, 101 2020.
  • Gupta A, Nigam S., A Review on Different Types of Lossless Data Compression Techniques. 2021.
  • Suneetha D, Kishore DR, Babu PN., A Compression Algorithm for DNA Palindrome Compression Technique, ITM Web of Conferences, Mumbai-Hindistan, 1-5, 27-28 Haziran 2020.
  • Rădescu R., Concordance Techniques in Lossless Data Compression of Text Files, 2021 12th International Symposium on Advanced Topics in Electrical Engineering (ATEE), Bükreş-Romanya, 1–4, 23-25 Mart 2021.
  • Abliz W, Wu H, Maimaiti M, Wushouer J, Abiderexiti K, Yibulayin T, Wumaier A., A Syllable-Based Technique for Uyghur Text Compression. Information, 11, 172, 2020.
  • Pandey M, Shrivastava S, Pandey S, Shridevi S., An Enhanced Data Compression Algorithm, 2020 International Conference on Emerging Trends in Information Technology and Engineering (ic-ETITE), Vellore-Hindistan, 1–4, 24-25 Şubat 2020.
  • Murugesan G., Codon Based Compression Algorithm for DNA Sequences with Two Bit Encoding. European Journal of Molecular & Clinical Medicine, 7, 33-41, 2020.
  • Silva M, Pratas D, Pinho AJ., Efficient DNA sequence compression with neural networks. GigaScience, 9(11), 1-15, 2020.
  • Ghuge S., Map and Trie based Compression Algorithm for Data Transmission, 2nd International Conference on Innovative Mechanisms for Industry Applications (ICIMIA), Bangalore-Hindistan, 137–141, 24-25 Şubat 2020.
  • Hilal TA, Hilal HA., Turkish Text Compression via Characters Encoding. Procedia Computer Science, 175, 286–91 2020.
  • Nguyen VH, Nguyen HT, Duong HN, Snasel V., n-Gram-based text compression. Computational intelligence and neuroscience, 2016, 1-12, 2016.
  • Demchenko Y, De Laat C, Membrey P., Defining architecture components of the Big Data Ecosystem, 2014 International conference on collaboration technologies and systems (CTS), Minneapolis-Minnesota-USA, 104–112, 19-23 Mayıs 2014.
  • Rattanaopas K, Kaewkeeree S., Improving Hadoop MapReduce performance with data compression: A study using wordcount job, 14th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON), Phuket-Tayland, 564–567, 27-30 Haziran 2017.
  • Bartík M, Ubik S, Kubalik P., LZ4 compression algorithm on FPGA, IEEE International Conference on Electronics, Circuits, and Systems (ICECS), Kahire-Mısır, 179–182, 6 Aralık 2015.
  • Guerra A, Lotero J, Isaza S., Performance comparison of sequential and parallel compression applications for DNA raw data. The Journal of Supercomputing, 72, 4696–717, 2016.
  • Sun Y, Gong X, Yang Y., Data compression and parallel computation model research under big data environment, International Conference on Computer Communication and Informatics (ICCCI), Lefkoşa-Kıbrıs, 1–4, 27-29 Eylül 2017.
  • Kumar VS, Nanjundiah R, Thazhuthaveetil MJ, Govindarajan R., Impact of message compression on the scalability of an atmospheric modeling application on clusters. Parallel Computing, 34, 1–16, 2008.
  • Ahmad I, He Y, Liou ML., Video compression with parallel processing. Parallel Computing, 28, 1039–78, 2002.
  • Adler M., pigz: A parallel implementation of gzip for modern multi-processor, multi-core machines. Jet Propulsion Laboratory, 2015.
  • Gilchrist J., Parallel data compression with bzip2, Proceedings of the 16th IASTED international conference on parallel and distributed computing and systems, Dallas-USA, 559–564, 14-16 Aralık 2004.
  • Bell T, Adjeroh D, Mukherjee A., Pattern matching in compressed texts and images. 2001.
  • Mishra SP, Singh CG, Prasad R., A review on compressed pattern matching. Perspectives in Science, 8, 727–9, 2016.
  • Karcıoğlu AA, Bulut H., DNA sekansları için q-gram hash karşılaştırmasına dayalı çoklu kesin dizi eşleştirme algoritması. Gazi Üniversitesi Mühendislik Mimarlık Fakültesi Dergisi, 38, 875–88, 2022.
  • BULUŞ HN, Carus A, Mesut A., A new word-based compression model allowing compressed pattern matching. Turkish Journal of Electrical Engineering & Computer Sciences, 25, 3607–22, 2017.
  • Öztürk E, Mesut A, Diri B., Multi-stream word-based compression algorithm for compressed text search. Arabian Journal for Science and Engineering, 43, 8209–21, 2018.
  • Srivastav S, Singh PK, Yadav D., A Method to Improve Exact Matching Results in Compressed Text using Parallel Wavelet Tree. Scalable Computing: Practice and Experience, 22, 387–400, 2021.
  • Russo LMS, Navarro G, Oliveira AL, Morales P., Approximate String Matching with Compressed Indexes. Algorithms, 2, 1105–36, 2009.
  • Melink S, Raghavan S, Yang B, Garcia-Molina H., Building a distributed full-text index for the web. ACM Transactions on Information Systems (TOIS), 19, 217–41, 2001.
  • Bast H, Buchhold B., An index for efficient semantic full-text search, Proceedings of the 22nd ACM international conference on Information & Knowledge Management, California-USA, 369–78, 27 Ekim - 1 Kasım 2013.
  • Deutsch P, others., GZIP file format specification version 4.3. 1996.
  • Deutsch P., Rfc1951: Deflate compressed data format specification version 1.3, RFC Editor, 1996.
  • Oswal S, Singh A, Kumari K., Deflate compression algorithm. International Journal of Engineering Research and General Science, 4, 430–6, 2016.
  • Aşşık MM, Oral M., Kanonik Huffman kod sözcükleri uzunluklarının evrim stratejileri algoritması ile belirlenmesi. Gazi Üniversitesi Mühendislik Mimarlık Fakültesi Dergisi, 38, 771–80, 2022.
  • Deutsch P, Gailly J-L., Zlib compressed data format specification version 3.3, RFC 1950, Mayıs, 1996.
  • Burrows M, Wheeler D., A block-sorting lossless data compression algorithm, Digital SRC Research Report, 1994.
  • Manzini G., An analysis of the Burrows—Wheeler transform. Journal of the ACM (JACM), 48, 407–30, 2001.
  • Collet Y, Kucherawy M., Zstandard Compression and the application/zstd Media Type. RFC 8478, 2018.
  • Duda J, Tahboub K, Gadgil NJ, Delp EJ., The use of asymmetric numeral systems as an accurate replacement for Huffman coding, 2015 Picture Coding Symposium (PCS), Cairns-Avustralya, 65–69, 31 Mayıs - 3 Haziran 2015.
  • Belkov R, Kirilenko I., Compressing Embedded GNU/Linux and Windows 10 IoT Images Using XZ Utilities, 1st Scientific and Practical Conference “Software Engineering and Information Organization”, SEIM-2016, St Petersburg-Rusya, 41, 22 Nisan 2016.
  • Kirby G., Zipf’s law. UK Journal of Naval Science, 10, 180–5, 1985.
  • Ferragina P, Navarro G., Pizza&Chili Corpus—Compressed indexes and their testbeds. September, 2005.
  • Mahoney M., Large text compression benchmark, http://www.mattmahoney.net/dc/text.html, Yayın tarihi Eylül 15, 2022. Erişim tarihi Mayıs 19, 2022.
Toplam 43 adet kaynakça vardır.

Ayrıntılar

Birincil Dil Türkçe
Konular Mühendislik
Bölüm Makaleler
Yazarlar

Emir Öztürk 0000-0002-3734-5171

Altan Mesut 0000-0002-1477-3093

Proje Numarası -
Erken Görünüm Tarihi 16 Mayıs 2024
Yayımlanma Tarihi 20 Mayıs 2024
Gönderilme Tarihi 5 Kasım 2022
Kabul Tarihi 27 Kasım 2023
Yayımlandığı Sayı Yıl 2024

Kaynak Göster

APA Öztürk, E., & Mesut, A. (2024). CComp: Sıkıştırılmış kelime arama için paralel bir sıkıştırma algoritması. Gazi Üniversitesi Mühendislik Mimarlık Fakültesi Dergisi, 39(3), 1933-1944. https://doi.org/10.17341/gazimmfd.1199811
AMA Öztürk E, Mesut A. CComp: Sıkıştırılmış kelime arama için paralel bir sıkıştırma algoritması. GUMMFD. Mayıs 2024;39(3):1933-1944. doi:10.17341/gazimmfd.1199811
Chicago Öztürk, Emir, ve Altan Mesut. “CComp: Sıkıştırılmış Kelime Arama için Paralel Bir sıkıştırma Algoritması”. Gazi Üniversitesi Mühendislik Mimarlık Fakültesi Dergisi 39, sy. 3 (Mayıs 2024): 1933-44. https://doi.org/10.17341/gazimmfd.1199811.
EndNote Öztürk E, Mesut A (01 Mayıs 2024) CComp: Sıkıştırılmış kelime arama için paralel bir sıkıştırma algoritması. Gazi Üniversitesi Mühendislik Mimarlık Fakültesi Dergisi 39 3 1933–1944.
IEEE E. Öztürk ve A. Mesut, “CComp: Sıkıştırılmış kelime arama için paralel bir sıkıştırma algoritması”, GUMMFD, c. 39, sy. 3, ss. 1933–1944, 2024, doi: 10.17341/gazimmfd.1199811.
ISNAD Öztürk, Emir - Mesut, Altan. “CComp: Sıkıştırılmış Kelime Arama için Paralel Bir sıkıştırma Algoritması”. Gazi Üniversitesi Mühendislik Mimarlık Fakültesi Dergisi 39/3 (Mayıs 2024), 1933-1944. https://doi.org/10.17341/gazimmfd.1199811.
JAMA Öztürk E, Mesut A. CComp: Sıkıştırılmış kelime arama için paralel bir sıkıştırma algoritması. GUMMFD. 2024;39:1933–1944.
MLA Öztürk, Emir ve Altan Mesut. “CComp: Sıkıştırılmış Kelime Arama için Paralel Bir sıkıştırma Algoritması”. Gazi Üniversitesi Mühendislik Mimarlık Fakültesi Dergisi, c. 39, sy. 3, 2024, ss. 1933-44, doi:10.17341/gazimmfd.1199811.
Vancouver Öztürk E, Mesut A. CComp: Sıkıştırılmış kelime arama için paralel bir sıkıştırma algoritması. GUMMFD. 2024;39(3):1933-44.