Research Article
BibTex RIS Cite

Kısa Metinlerin Sıkıştırılması için BERT Tabanlı bir Yöntem

Year 2021, , 177 - 182, 31.12.2021
https://doi.org/10.31590/ejosat.1039450

Abstract

Veri aktarımı ve saklanmasında veri sıkıştırma algoritmalarının kullanılması, aktarım süresi ve saklama maliyeti açısından avantaj sağlamaktadır. En çok üretilen veri türlerinden biri olan doğal dildeki metinlerin sıkıştırılması için farklı yöntemler bulunmaktadır. Geleneksel birçok yöntem kısa metinlerin sıkıştırılmasında başarı gösterememektedir. Kısa metinlerin sıkıştırılması için genel amaçlı sıkıştırma yöntemlerinden daha farklı yöntemlere ihtiyaç duyulmaktadır. Bu çalışmada BERT’in tahmin mekanizmasını kullanan bir kısa metin sıkıştırma algoritması önerilmiş ve geleneksel yöntemler ile karşılaştırılmıştır. Ayrıca önerilen yöntemin başarısı farklı parametreler ve modeller için incelenmiş ve karşılaştırılmıştır. Önerilen yöntem Gzip, Bzip2 ve Zstd gibi bilinen algoritmalara göre %39’a kadar daha başarılı sıkıştırma oranları elde etmiştir.

References

  • Aslanyürek, M., & Mesut, A. (2021). A Static Dictionary-Based Approach To Compressing Short Texts. 2021 6th International Conference on Computer Science and Engineering (UBMK), 342–347.
  • Collet, Y., & Kucherawy, M. (2018). Zstandard Compression and the application/zstd Media Type. RFC 8478.
  • Deutsch, P. (1996). DEFLATE compressed data format specification version 1.3. https://www.rfc-editor.org/info/rfc1951.
  • Deutsch, P. (1996). GZIP file format specication version 4.3. https://www.rfc-editor.org/info/rfc1952.
  • Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). {BERT:} Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR, abs/1810.0. http://arxiv.org/abs/1810.04805
  • Duda, J., Tahboub, K., Gadgil, N. J., & Delp, E. J. (2015). The use of asymmetric numeral systems as an accurate replacement for Huffman coding. 2015 Picture Coding Symposium, PCS 2015 - with 2015 Packet Video Workshop, PV 2015 - Proceedings. https://doi.org/10.1109/PCS.2015.7170048
  • Gardner-Stephen, P., Bettison, A., Challans, R., Hampton, J., Lakeman, J., & Wallis, C. (2013). Improving Compression of Short Messages. International Journal of Communications, Network and System Sciences, 06(12), 497–504. https://doi.org/10.4236/ijcns.2013.612053
  • Huffman, D. A. (1952). A method for the construction of minimum-redundancy codes. Proceedings of the IRE, 40(9), 1098–1101.
  • Manzini, G. (2001). An analysis of the Burrows—Wheeler transform. Journal of the ACM (JACM), 48(3), 407–430.
  • Mathews, G. J. (1995). Selecting a general-purpose data compression algorithm. Proceedings of the Science Information Management and Data Compression Workshop, 55–64.
  • Nguyen, V. H., Nguyen, H. T., Duong, H. N., & Snasel, V. (2016). n -Gram-Based Text Compression. 2016.
  • Öztürk, E., Mesut, A., & Diri, B. (2017). Multi-stream word-based compression algorithm. 2nd International Conference on Computer Science and Engineering, UBMK 2017. https://doi.org/10.1109/UBMK.2017.8093552
  • Öztürk, E., Mesut, A., & Diri, B. (2018). Multi-Stream Word-Based Compression Algorithm for Compressed Text Search. Arabian Journal for Science and Engineering, 43(12), 8209–8221. https://doi.org/10.1007/s13369-018-3378-9
  • Platoš, J., Snášel, V., & El-Qawasmeh, E. (2008). Compression of small text files. Advanced Engineering Informatics, 22(3), 410–417. https://doi.org/10.1016/j.aei.2008.05.001 Sabancı Üniversitesi Veri Analitiği Araştırma ve Uygulama Merkezi. (2018). SuDer Corpus - Turkish News Collections for Text Categorization. https://github.com/suverim/suder
  • Sanfilippo, S. (2009). SMAZ—Compression for Very Small Strings. https://github.com/antirez/smaz Say, B., Zeyrek, D., Oflazer, K., & Özge, U. (2002). Development of a corpus and a treebank for present-day written Turkish. In Proceedings of the eleventh international conference of Turkish linguistics (pp. 183-192).
  • Schramm, C. (2013). Shoco: a fast compressor for short strings. https://ed-von-schleck.github.io/shoco/.
  • Seward, J. (1996). bzip2 and libbzip2, version 1.0.8. https://www.sourceware.org/bzip2/manual/manual.pdf.
  • Storer, J. A., & Szymanski, T. G. (1982). Data Compression via Textual Substitution. J. ACM, 29(4), 928–951. https://doi.org/10.1145/322344.322346
  • Ziviani, N., De Moura, E. S., Navarro, G., & Baeza-Yates, R. (2000). Compression: A key for next-generation text retrieval systems. Computer, 33(11), 37–44.

A BERT-Based Method for Compressing Short Texts

Year 2021, , 177 - 182, 31.12.2021
https://doi.org/10.31590/ejosat.1039450

Abstract

Using data compression algorithms in data transmission and storage provides advantages in terms of time and storage cost. There are several methods for compressing texts created in natural language which is one of the most produced data types. Many traditional methods are not successful in compressing short texts. Compressing short texts requires different methods than general-purpose compression methods. In this study, a short text compression algorithm which uses the prediction mechanism of BERT is proposed and compared with traditional methods. In addition, the results of the proposed method were examined and compared for different parameters and models. The proposed method has achieved compression ratios up to 39% better than traditional algorithms such as Gzip, Bzip2 and Zstd.

References

  • Aslanyürek, M., & Mesut, A. (2021). A Static Dictionary-Based Approach To Compressing Short Texts. 2021 6th International Conference on Computer Science and Engineering (UBMK), 342–347.
  • Collet, Y., & Kucherawy, M. (2018). Zstandard Compression and the application/zstd Media Type. RFC 8478.
  • Deutsch, P. (1996). DEFLATE compressed data format specification version 1.3. https://www.rfc-editor.org/info/rfc1951.
  • Deutsch, P. (1996). GZIP file format specication version 4.3. https://www.rfc-editor.org/info/rfc1952.
  • Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). {BERT:} Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR, abs/1810.0. http://arxiv.org/abs/1810.04805
  • Duda, J., Tahboub, K., Gadgil, N. J., & Delp, E. J. (2015). The use of asymmetric numeral systems as an accurate replacement for Huffman coding. 2015 Picture Coding Symposium, PCS 2015 - with 2015 Packet Video Workshop, PV 2015 - Proceedings. https://doi.org/10.1109/PCS.2015.7170048
  • Gardner-Stephen, P., Bettison, A., Challans, R., Hampton, J., Lakeman, J., & Wallis, C. (2013). Improving Compression of Short Messages. International Journal of Communications, Network and System Sciences, 06(12), 497–504. https://doi.org/10.4236/ijcns.2013.612053
  • Huffman, D. A. (1952). A method for the construction of minimum-redundancy codes. Proceedings of the IRE, 40(9), 1098–1101.
  • Manzini, G. (2001). An analysis of the Burrows—Wheeler transform. Journal of the ACM (JACM), 48(3), 407–430.
  • Mathews, G. J. (1995). Selecting a general-purpose data compression algorithm. Proceedings of the Science Information Management and Data Compression Workshop, 55–64.
  • Nguyen, V. H., Nguyen, H. T., Duong, H. N., & Snasel, V. (2016). n -Gram-Based Text Compression. 2016.
  • Öztürk, E., Mesut, A., & Diri, B. (2017). Multi-stream word-based compression algorithm. 2nd International Conference on Computer Science and Engineering, UBMK 2017. https://doi.org/10.1109/UBMK.2017.8093552
  • Öztürk, E., Mesut, A., & Diri, B. (2018). Multi-Stream Word-Based Compression Algorithm for Compressed Text Search. Arabian Journal for Science and Engineering, 43(12), 8209–8221. https://doi.org/10.1007/s13369-018-3378-9
  • Platoš, J., Snášel, V., & El-Qawasmeh, E. (2008). Compression of small text files. Advanced Engineering Informatics, 22(3), 410–417. https://doi.org/10.1016/j.aei.2008.05.001 Sabancı Üniversitesi Veri Analitiği Araştırma ve Uygulama Merkezi. (2018). SuDer Corpus - Turkish News Collections for Text Categorization. https://github.com/suverim/suder
  • Sanfilippo, S. (2009). SMAZ—Compression for Very Small Strings. https://github.com/antirez/smaz Say, B., Zeyrek, D., Oflazer, K., & Özge, U. (2002). Development of a corpus and a treebank for present-day written Turkish. In Proceedings of the eleventh international conference of Turkish linguistics (pp. 183-192).
  • Schramm, C. (2013). Shoco: a fast compressor for short strings. https://ed-von-schleck.github.io/shoco/.
  • Seward, J. (1996). bzip2 and libbzip2, version 1.0.8. https://www.sourceware.org/bzip2/manual/manual.pdf.
  • Storer, J. A., & Szymanski, T. G. (1982). Data Compression via Textual Substitution. J. ACM, 29(4), 928–951. https://doi.org/10.1145/322344.322346
  • Ziviani, N., De Moura, E. S., Navarro, G., & Baeza-Yates, R. (2000). Compression: A key for next-generation text retrieval systems. Computer, 33(11), 37–44.
There are 19 citations in total.

Details

Primary Language Turkish
Subjects Engineering
Journal Section Articles
Authors

Emir Öztürk 0000-0002-3734-5171

Altan Mesut 0000-0002-1477-3093

Publication Date December 31, 2021
Published in Issue Year 2021

Cite

APA Öztürk, E., & Mesut, A. (2021). Kısa Metinlerin Sıkıştırılması için BERT Tabanlı bir Yöntem. Avrupa Bilim Ve Teknoloji Dergisi(32), 177-182. https://doi.org/10.31590/ejosat.1039450