Terim-Doküman Matrisleri için Sıralamaya Dayalı Bir Kayıpsız Sıkıştırma Şeması

Can Özbey; Murat Cihan Sorkun

Araştırma Makalesi

Terim-Doküman Matrisleri için Sıralamaya Dayalı Bir Kayıpsız Sıkıştırma Şeması

Yıl 2018, Cilt: 11 Sayı: 2, 30 - 40, 15.11.2018

Can Özbey , Murat Cihan Sorkun

Öz

Kayıpsız veri sıkıştırma, özellikle
bellek içi veri tabanları ve önbellek kullanımlı bilgi geri kazanım sistemlerinde,
harcanan disk alanını azaltmasının yanı sıra, etkin kod çözme algoritmaları
aracılığıyla bilgiye erişimi hızlandırması sebebiyle önem arz etmektedir. Bu
çalışmada, ters dizinlerin sıkıştırılması kapsamında yeni bir değişken sekiz
ikili kodlama yöntemi geliştirilmiş ve terim-doküman matrisinin bant
genişliğinin indirgenmesi amacıyla tepe tırmanmaya dayalı çift kutuplu dizilim şeması
önerilmiştir. Bu şema, internet üzerinden toplanan haber metinlerine
uygulanarak doküman dizmenin dizin sıkıştırma oranına olan etkisi incelenmiştir.

Anahtar Kelimeler

Terim-Doküman Matrisi, Kayıpsız Veri Sıkıştırma, Ters Dizin Sıkıştırma, Değişken Sekiz İkili Kodlama, Özyinelemeli Sekiz İkili Kodlama, Doküman Dizme, Matris Bant Genişliği İndirgeme, Çift Kutuplu Sıralama, Tepe Tırmanma

Kaynakça

[1] G.Graefe and L.Shapiro. Data compression and database performance. In ACM/IEEE-CS Symp. On Applied Computing pages 22 -27, April 1991.
[2] M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. J. O’Neil, P. E. O’Neil, A. Rasin, N. Tran, and S. B. Zdonik. C-Store: A column-oriented DBMS. In VLDB, pages 553–564, 2005.
[3] D. J. Abadi, S. Madden, and M. Ferreira. Integrating Compression and Execution in Column-Oriented Database Systems. In SIGMOD, 2006.
[4] C. Lin, J. Wang, and Y. Papakonstantinou. Data Compression for Analytics over Large-scale In-memory Column Databases, ACM 2016.
[5] H. Schütze, C. D. Manning, and P. Raghavan. Introduction to Information Retrieval. Vol. 39. Cambridge University Press, 2008.
[6] I. H. Witten, A. Moffat, T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. San Francisco, CA, USA, 1999.
[7] R. Grossi and J. S. Vitter. Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching. In 32nd ACM Symposium on Theory of Computing, pages 397–406, 2000.
[8] M. A. Martinez-Prieto, N. Brisaboa, R. Canovas, F. Claude, and G. Navarro. Practical compressed string dictionaries. Information Systems, 56:73–108, 2016.
[9] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley, 1999.
[10] H. E. Williams,J. Zobel. Compressing Integersfor Fast File Access. Comput.J.42, pp. 193-201, 1999.
[11] P. Elias. Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory, IT-21(2):194–203, 1975.
[12] V. Anh and A. Moffat. Index compression using fixed binary codewords. In Proc. of the 15th Int. Australasian Database Conference, pages 61–67, 2004.
[13] D. A. Huffman. A method for the construction of minimum-redundancy codes. Proceedings of the IRE 40.9, pages 1098-1101, 1952.
[14] I. H. Witten, M. N. Radford, and G. C. John. Arithmetic coding for data compression. Communications of the ACM 30.6, pages 520-540, 1987. [15] A. S. Fraenkel and S. T. Klein. Novel compression of sparse bit-strings. Preliminary report. Combinatorial Algorithms on Words, Volume 12. NATO ASI Series F, pages 169-183, 1985.
[16] S. W. Golomb. Run-length encoding. IEEE Transactions on Information Theory, vol. IT-12, pp. 399-401, 1966.
[17] R. Gallager, and D. Van Voorhis. Optimal source codes for geometrically distributed integer alphabets. IEEE Transactions on Information theory 21.2,pages: 228-230, 1975.
[18] A. Trotman. Compressing inverted files. Information Retrieval, 6.1: 5-19, 2003.
[19] F. Scholer, H.E. Williams, J. Yiannis, J. Zobel. Compressionof Inverted Indexes for Fast Query Evaluation. In SIGIR 2002, pp. 222-229, 2002.
[20] J. Plaisance, N. Kurz, and D. Lemire. Vectorized vbyte decoding. CoRR, 2015.
[21] M. C. Sorkunand C. Özbey. Compression experiments on term-document index. Computer Science and Engineering (UBMK), 2017 International Conference on. IEEE, 2017.
[22] H. Yan, S. Ding, and T. Suel. Inverted index compression and query processing with optimized document ordering. In Proceedings of the 18th international conference on World wide web. ACM, 401–410, 2009.
[23] C. Papadimitriou. The NP-completeness of the bandwidth minimization problem.Computing, vol. 16, pp. 263–270, 1976.
[24] A. Lim, B. Rodrigues, F. Xiao. Integrated genetic algorithm with hill climbing forbandwidth minimization problem. GECCO 2003. LNCS, vol. 2724, pp. 1594–1595, 2003.
[25] E. Rodriguez-Tello, J. K. Hao, J. Torres-Jimenez. An improved simulated annealing algorithm for bandwidth minimization, European Journal of Operational Research 185, pp. 1319–1335, 2008.
[26] R. Martí, M. Laguna, F. Glover and V. Campos.Reducing the bandwidth ofa sparse matrix with tabu search. European Journal of Operational Research 135, pp. 450–459, 2001.
[27] E. Cuthill, J. McKee. Reducing the bandwidth of sparse symmetric matrices. In Proceedings of the 1969 24th National Conference (New York, NY, USA, 1969), ACM ’69, ACM, pp. 157–172, 1969.
[28] N. E. Gibbs, W. G. Poole and P. K. Stockmeyer. An algorithm for reducing the bandwidth and profile of a sparse matrix. SIAM Journal on Numerical Analysis 13, pp. 236-250, 1976.
[29] S. W. Sloan. An algorithm for profile and wavefront reduction of sparse matrices. International Journal for Numerical Methods in Engineering 23, 2. 239–251, 1986.
[30] D. Harel and Y. Koren. Graph drawing by high-dimensionalembedding. In Revised Papers from the 10th International Symposium on Graph Drawing. GD ’02, Springer-Verlag, pp. 207–219, 2002.
[31] I. Spence andJ. Graef. The determination of the underlying dimensionalityof an empirically obtained matrix of proximities. Multivariate Behavioral Research 9, pp. 331–341, 1974.
[32] Y. Cheng, G. M. Church. Biclustering of expression data. In Ismb, vol. 8, pp. 93–103, 2000.
[33] F. Silvestri. Sorting out the document identifier assignment problem. In Proc. of 29th European Conf. on Information Retrieval, pages 101–112, 2007.
[34] V. Ramaswamy, R. Konow, A. Trotman, J. Degenhardt, and N. Whyte. Document Reordering is Good, Especially for e-Commerce. In Proceedings of the SIGIR 2017 Workshop on eCommerce (ECOM 17), 2017.
[35] S. Büttcher, C. Clarke, and G. Cormack. Information Retrieval: Implementing and Evaluating Search Engines. MIT Press, p. 214, 2010.
[36] Y. Nekritch. Decoding of canonical Huffman codes with look-up tables. In Proc. Conf. Data Compression, p. 566, 2000.

Yıl 2018, Cilt: 11 Sayı: 2, 30 - 40, 15.11.2018

Can Özbey , Murat Cihan Sorkun

Öz

Kaynakça

[1] G.Graefe and L.Shapiro. Data compression and database performance. In ACM/IEEE-CS Symp. On Applied Computing pages 22 -27, April 1991.
[2] M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. J. O’Neil, P. E. O’Neil, A. Rasin, N. Tran, and S. B. Zdonik. C-Store: A column-oriented DBMS. In VLDB, pages 553–564, 2005.
[3] D. J. Abadi, S. Madden, and M. Ferreira. Integrating Compression and Execution in Column-Oriented Database Systems. In SIGMOD, 2006.
[4] C. Lin, J. Wang, and Y. Papakonstantinou. Data Compression for Analytics over Large-scale In-memory Column Databases, ACM 2016.
[5] H. Schütze, C. D. Manning, and P. Raghavan. Introduction to Information Retrieval. Vol. 39. Cambridge University Press, 2008.
[6] I. H. Witten, A. Moffat, T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. San Francisco, CA, USA, 1999.
[7] R. Grossi and J. S. Vitter. Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching. In 32nd ACM Symposium on Theory of Computing, pages 397–406, 2000.
[8] M. A. Martinez-Prieto, N. Brisaboa, R. Canovas, F. Claude, and G. Navarro. Practical compressed string dictionaries. Information Systems, 56:73–108, 2016.
[9] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley, 1999.
[10] H. E. Williams,J. Zobel. Compressing Integersfor Fast File Access. Comput.J.42, pp. 193-201, 1999.
[11] P. Elias. Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory, IT-21(2):194–203, 1975.
[12] V. Anh and A. Moffat. Index compression using fixed binary codewords. In Proc. of the 15th Int. Australasian Database Conference, pages 61–67, 2004.
[13] D. A. Huffman. A method for the construction of minimum-redundancy codes. Proceedings of the IRE 40.9, pages 1098-1101, 1952.
[14] I. H. Witten, M. N. Radford, and G. C. John. Arithmetic coding for data compression. Communications of the ACM 30.6, pages 520-540, 1987. [15] A. S. Fraenkel and S. T. Klein. Novel compression of sparse bit-strings. Preliminary report. Combinatorial Algorithms on Words, Volume 12. NATO ASI Series F, pages 169-183, 1985.
[16] S. W. Golomb. Run-length encoding. IEEE Transactions on Information Theory, vol. IT-12, pp. 399-401, 1966.
[17] R. Gallager, and D. Van Voorhis. Optimal source codes for geometrically distributed integer alphabets. IEEE Transactions on Information theory 21.2,pages: 228-230, 1975.
[18] A. Trotman. Compressing inverted files. Information Retrieval, 6.1: 5-19, 2003.
[19] F. Scholer, H.E. Williams, J. Yiannis, J. Zobel. Compressionof Inverted Indexes for Fast Query Evaluation. In SIGIR 2002, pp. 222-229, 2002.
[20] J. Plaisance, N. Kurz, and D. Lemire. Vectorized vbyte decoding. CoRR, 2015.
[21] M. C. Sorkunand C. Özbey. Compression experiments on term-document index. Computer Science and Engineering (UBMK), 2017 International Conference on. IEEE, 2017.
[22] H. Yan, S. Ding, and T. Suel. Inverted index compression and query processing with optimized document ordering. In Proceedings of the 18th international conference on World wide web. ACM, 401–410, 2009.
[23] C. Papadimitriou. The NP-completeness of the bandwidth minimization problem.Computing, vol. 16, pp. 263–270, 1976.
[24] A. Lim, B. Rodrigues, F. Xiao. Integrated genetic algorithm with hill climbing forbandwidth minimization problem. GECCO 2003. LNCS, vol. 2724, pp. 1594–1595, 2003.
[25] E. Rodriguez-Tello, J. K. Hao, J. Torres-Jimenez. An improved simulated annealing algorithm for bandwidth minimization, European Journal of Operational Research 185, pp. 1319–1335, 2008.
[26] R. Martí, M. Laguna, F. Glover and V. Campos.Reducing the bandwidth ofa sparse matrix with tabu search. European Journal of Operational Research 135, pp. 450–459, 2001.
[27] E. Cuthill, J. McKee. Reducing the bandwidth of sparse symmetric matrices. In Proceedings of the 1969 24th National Conference (New York, NY, USA, 1969), ACM ’69, ACM, pp. 157–172, 1969.
[28] N. E. Gibbs, W. G. Poole and P. K. Stockmeyer. An algorithm for reducing the bandwidth and profile of a sparse matrix. SIAM Journal on Numerical Analysis 13, pp. 236-250, 1976.
[29] S. W. Sloan. An algorithm for profile and wavefront reduction of sparse matrices. International Journal for Numerical Methods in Engineering 23, 2. 239–251, 1986.
[30] D. Harel and Y. Koren. Graph drawing by high-dimensionalembedding. In Revised Papers from the 10th International Symposium on Graph Drawing. GD ’02, Springer-Verlag, pp. 207–219, 2002.
[31] I. Spence andJ. Graef. The determination of the underlying dimensionalityof an empirically obtained matrix of proximities. Multivariate Behavioral Research 9, pp. 331–341, 1974.
[32] Y. Cheng, G. M. Church. Biclustering of expression data. In Ismb, vol. 8, pp. 93–103, 2000.
[33] F. Silvestri. Sorting out the document identifier assignment problem. In Proc. of 29th European Conf. on Information Retrieval, pages 101–112, 2007.
[34] V. Ramaswamy, R. Konow, A. Trotman, J. Degenhardt, and N. Whyte. Document Reordering is Good, Especially for e-Commerce. In Proceedings of the SIGIR 2017 Workshop on eCommerce (ECOM 17), 2017.
[35] S. Büttcher, C. Clarke, and G. Cormack. Information Retrieval: Implementing and Evaluating Search Engines. MIT Press, p. 214, 2010.
[36] Y. Nekritch. Decoding of canonical Huffman codes with look-up tables. In Proc. Conf. Data Compression, p. 566, 2000.

Toplam 35 adet kaynakça vardır.

Ayrıntılar

Birincil Dil	Türkçe
Konular	Mühendislik
Bölüm	Makaleler(Araştırma)
Yazarlar	Can Özbey Murat Cihan Sorkun Bu kişi benim
Yayımlanma Tarihi	15 Kasım 2018
Yayımlandığı Sayı	Yıl 2018 Cilt: 11 Sayı: 2

Kaynak Göster

APA	Özbey, C., & Sorkun, M. C. (2018). Terim-Doküman Matrisleri için Sıralamaya Dayalı Bir Kayıpsız Sıkıştırma Şeması. Türkiye Bilişim Vakfı Bilgisayar Bilimleri Ve Mühendisliği Dergisi, 11(2), 30-40.
AMA	Özbey C, Sorkun MC. Terim-Doküman Matrisleri için Sıralamaya Dayalı Bir Kayıpsız Sıkıştırma Şeması. TBV-BBMD. Kasım 2018;11(2):30-40.
Chicago	Özbey, Can, ve Murat Cihan Sorkun. “Terim-Doküman Matrisleri için Sıralamaya Dayalı Bir Kayıpsız Sıkıştırma Şeması”. Türkiye Bilişim Vakfı Bilgisayar Bilimleri Ve Mühendisliği Dergisi 11, sy. 2 (Kasım 2018): 30-40.
EndNote	Özbey C, Sorkun MC (01 Kasım 2018) Terim-Doküman Matrisleri için Sıralamaya Dayalı Bir Kayıpsız Sıkıştırma Şeması. Türkiye Bilişim Vakfı Bilgisayar Bilimleri ve Mühendisliği Dergisi 11 2 30–40.
IEEE	C. Özbey ve M. C. Sorkun, “Terim-Doküman Matrisleri için Sıralamaya Dayalı Bir Kayıpsız Sıkıştırma Şeması”, TBV-BBMD, c. 11, sy. 2, ss. 30–40, 2018.
ISNAD	Özbey, Can - Sorkun, Murat Cihan. “Terim-Doküman Matrisleri için Sıralamaya Dayalı Bir Kayıpsız Sıkıştırma Şeması”. Türkiye Bilişim Vakfı Bilgisayar Bilimleri ve Mühendisliği Dergisi 11/2 (Kasım 2018), 30-40.
JAMA	Özbey C, Sorkun MC. Terim-Doküman Matrisleri için Sıralamaya Dayalı Bir Kayıpsız Sıkıştırma Şeması. TBV-BBMD. 2018;11:30–40.
MLA	Özbey, Can ve Murat Cihan Sorkun. “Terim-Doküman Matrisleri için Sıralamaya Dayalı Bir Kayıpsız Sıkıştırma Şeması”. Türkiye Bilişim Vakfı Bilgisayar Bilimleri Ve Mühendisliği Dergisi, c. 11, sy. 2, 2018, ss. 30-40.
Vancouver	Özbey C, Sorkun MC. Terim-Doküman Matrisleri için Sıralamaya Dayalı Bir Kayıpsız Sıkıştırma Şeması. TBV-BBMD. 2018;11(2):30-4.

Kapak Resmi İndir

Makale Dosyaları

Tam Metin

https://i.creativecommons.org/l/by-nc/4.0Makale Kabulü

Çevrimiçi makale yüklemesi yapmak için kullanıcı kayıt/girişini kullanınız.

Dergiye gönderilen makalelerin kabul süreci şu aşamalardan oluşmaktadır:

1. Gönderilen her makale ilk aşamada en az iki hakeme gönderilmektedir.

2. Hakem ataması, dergi editörleri tarafından yapılmaktadır. Derginin hakem havuzunda yaklaşık 200 hakem bulunmaktadır ve bu hakemler ilgi alanlarına göre sınıflandırılmıştır. Her hakeme ilgilendiği konuda makale gönderilmektedir. Hakem seçimi menfaat çatışmasına neden olmayacak biçimde yapılmaktadır.

3. Hakemlere gönderilen makalelerde yazar adları kapatılmaktadır.

4. Hakemlere bir makalenin nasıl değerlendirileceği açıklanmaktadır ve aşağıda görülen değerlendirme formunu doldurmaları istenmektedir.

5. İki hakemin olumlu görüş bildirdiği makaleler editörler tarafından benzerlik incelemesinden geçirilir. Makalelerdeki benzerliğin %25’ten küçük olması beklenir.

6. Tüm aşamaları geçmiş olan bir bildiri dil ve sunuş açısından editör tarafından incelenir ve gerekli düzeltme ve iyileştirmeler yapılır. Gerekirse yazarlara durum bildirilir.

Bu eser Creative Commons Atıf-GayriTicari 4.0 Uluslararası Lisansı ile lisanslanmıştır.