Research Article
BibTex RIS Cite

Terim-Doküman Matrisleri için Sıralamaya Dayalı Bir Kayıpsız Sıkıştırma Şeması

Year 2018, Volume: 11 Issue: 2, 30 - 40, 15.11.2018

Abstract

Kayıpsız veri sıkıştırma, özellikle
bellek içi veri tabanları ve önbellek kullanımlı bilgi geri kazanım sistemlerinde,
harcanan disk alanını azaltmasının yanı sıra, etkin kod çözme algoritmaları
aracılığıyla bilgiye erişimi hızlandırması sebebiyle önem arz etmektedir. Bu
çalışmada, ters dizinlerin sıkıştırılması kapsamında yeni bir değişken sekiz
ikili kodlama yöntemi geliştirilmiş ve terim-doküman matrisinin bant
genişliğinin indirgenmesi amacıyla tepe tırmanmaya dayalı çift kutuplu dizilim şeması
önerilmiştir. Bu şema, internet üzerinden toplanan haber metinlerine
uygulanarak doküman dizmenin dizin sıkıştırma oranına olan etkisi incelenmiştir.

References

  • [1] G.Graefe and L.Shapiro. Data compression and database performance. In ACM/IEEE-CS Symp. On Applied Computing pages 22 -27, April 1991.
  • [2] M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. J. O’Neil, P. E. O’Neil, A. Rasin, N. Tran, and S. B. Zdonik. C-Store: A column-oriented DBMS. In VLDB, pages 553–564, 2005.
  • [3] D. J. Abadi, S. Madden, and M. Ferreira. Integrating Compression and Execution in Column-Oriented Database Systems. In SIGMOD, 2006.
  • [4] C. Lin, J. Wang, and Y. Papakonstantinou. Data Compression for Analytics over Large-scale In-memory Column Databases, ACM 2016.
  • [5] H. Schütze, C. D. Manning, and P. Raghavan. Introduction to Information Retrieval. Vol. 39. Cambridge University Press, 2008.
  • [6] I. H. Witten, A. Moffat, T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. San Francisco, CA, USA, 1999.
  • [7] R. Grossi and J. S. Vitter. Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching. In 32nd ACM Symposium on Theory of Computing, pages 397–406, 2000.
  • [8] M. A. Martinez-Prieto, N. Brisaboa, R. Canovas, F. Claude, and G. Navarro. Practical compressed string dictionaries. Information Systems, 56:73–108, 2016.
  • [9] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley, 1999.
  • [10] H. E. Williams,J. Zobel. Compressing Integersfor Fast File Access. Comput.J.42, pp. 193-201, 1999.
  • [11] P. Elias. Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory, IT-21(2):194–203, 1975.
  • [12] V. Anh and A. Moffat. Index compression using fixed binary codewords. In Proc. of the 15th Int. Australasian Database Conference, pages 61–67, 2004.
  • [13] D. A. Huffman. A method for the construction of minimum-redundancy codes. Proceedings of the IRE 40.9, pages 1098-1101, 1952.
  • [14] I. H. Witten, M. N. Radford, and G. C. John. Arithmetic coding for data compression. Communications of the ACM 30.6, pages 520-540, 1987. [15] A. S. Fraenkel and S. T. Klein. Novel compression of sparse bit-strings. Preliminary report. Combinatorial Algorithms on Words, Volume 12. NATO ASI Series F, pages 169-183, 1985.
  • [16] S. W. Golomb. Run-length encoding. IEEE Transactions on Information Theory, vol. IT-12, pp. 399-401, 1966.
  • [17] R. Gallager, and D. Van Voorhis. Optimal source codes for geometrically distributed integer alphabets. IEEE Transactions on Information theory 21.2,pages: 228-230, 1975.
  • [18] A. Trotman. Compressing inverted files. Information Retrieval, 6.1: 5-19, 2003.
  • [19] F. Scholer, H.E. Williams, J. Yiannis, J. Zobel. Compressionof Inverted Indexes for Fast Query Evaluation. In SIGIR 2002, pp. 222-229, 2002.
  • [20] J. Plaisance, N. Kurz, and D. Lemire. Vectorized vbyte decoding. CoRR, 2015.
  • [21] M. C. Sorkunand C. Özbey. Compression experiments on term-document index. Computer Science and Engineering (UBMK), 2017 International Conference on. IEEE, 2017.
  • [22] H. Yan, S. Ding, and T. Suel. Inverted index compression and query processing with optimized document ordering. In Proceedings of the 18th international conference on World wide web. ACM, 401–410, 2009.
  • [23] C. Papadimitriou. The NP-completeness of the bandwidth minimization problem.Computing, vol. 16, pp. 263–270, 1976.
  • [24] A. Lim, B. Rodrigues, F. Xiao. Integrated genetic algorithm with hill climbing forbandwidth minimization problem. GECCO 2003. LNCS, vol. 2724, pp. 1594–1595, 2003.
  • [25] E. Rodriguez-Tello, J. K. Hao, J. Torres-Jimenez. An improved simulated annealing algorithm for bandwidth minimization, European Journal of Operational Research 185, pp. 1319–1335, 2008.
  • [26] R. Martí, M. Laguna, F. Glover and V. Campos.Reducing the bandwidth ofa sparse matrix with tabu search. European Journal of Operational Research 135, pp. 450–459, 2001.
  • [27] E. Cuthill, J. McKee. Reducing the bandwidth of sparse symmetric matrices. In Proceedings of the 1969 24th National Conference (New York, NY, USA, 1969), ACM ’69, ACM, pp. 157–172, 1969.
  • [28] N. E. Gibbs, W. G. Poole and P. K. Stockmeyer. An algorithm for reducing the bandwidth and profile of a sparse matrix. SIAM Journal on Numerical Analysis 13, pp. 236-250, 1976.
  • [29] S. W. Sloan. An algorithm for profile and wavefront reduction of sparse matrices. International Journal for Numerical Methods in Engineering 23, 2. 239–251, 1986.
  • [30] D. Harel and Y. Koren. Graph drawing by high-dimensionalembedding. In Revised Papers from the 10th International Symposium on Graph Drawing. GD ’02, Springer-Verlag, pp. 207–219, 2002.
  • [31] I. Spence andJ. Graef. The determination of the underlying dimensionalityof an empirically obtained matrix of proximities. Multivariate Behavioral Research 9, pp. 331–341, 1974.
  • [32] Y. Cheng, G. M. Church. Biclustering of expression data. In Ismb, vol. 8, pp. 93–103, 2000.
  • [33] F. Silvestri. Sorting out the document identifier assignment problem. In Proc. of 29th European Conf. on Information Retrieval, pages 101–112, 2007.
  • [34] V. Ramaswamy, R. Konow, A. Trotman, J. Degenhardt, and N. Whyte. Document Reordering is Good, Especially for e-Commerce. In Proceedings of the SIGIR 2017 Workshop on eCommerce (ECOM 17), 2017.
  • [35] S. Büttcher, C. Clarke, and G. Cormack. Information Retrieval: Implementing and Evaluating Search Engines. MIT Press, p. 214, 2010.
  • [36] Y. Nekritch. Decoding of canonical Huffman codes with look-up tables. In Proc. Conf. Data Compression, p. 566, 2000.
Year 2018, Volume: 11 Issue: 2, 30 - 40, 15.11.2018

Abstract

References

  • [1] G.Graefe and L.Shapiro. Data compression and database performance. In ACM/IEEE-CS Symp. On Applied Computing pages 22 -27, April 1991.
  • [2] M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. J. O’Neil, P. E. O’Neil, A. Rasin, N. Tran, and S. B. Zdonik. C-Store: A column-oriented DBMS. In VLDB, pages 553–564, 2005.
  • [3] D. J. Abadi, S. Madden, and M. Ferreira. Integrating Compression and Execution in Column-Oriented Database Systems. In SIGMOD, 2006.
  • [4] C. Lin, J. Wang, and Y. Papakonstantinou. Data Compression for Analytics over Large-scale In-memory Column Databases, ACM 2016.
  • [5] H. Schütze, C. D. Manning, and P. Raghavan. Introduction to Information Retrieval. Vol. 39. Cambridge University Press, 2008.
  • [6] I. H. Witten, A. Moffat, T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. San Francisco, CA, USA, 1999.
  • [7] R. Grossi and J. S. Vitter. Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching. In 32nd ACM Symposium on Theory of Computing, pages 397–406, 2000.
  • [8] M. A. Martinez-Prieto, N. Brisaboa, R. Canovas, F. Claude, and G. Navarro. Practical compressed string dictionaries. Information Systems, 56:73–108, 2016.
  • [9] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley, 1999.
  • [10] H. E. Williams,J. Zobel. Compressing Integersfor Fast File Access. Comput.J.42, pp. 193-201, 1999.
  • [11] P. Elias. Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory, IT-21(2):194–203, 1975.
  • [12] V. Anh and A. Moffat. Index compression using fixed binary codewords. In Proc. of the 15th Int. Australasian Database Conference, pages 61–67, 2004.
  • [13] D. A. Huffman. A method for the construction of minimum-redundancy codes. Proceedings of the IRE 40.9, pages 1098-1101, 1952.
  • [14] I. H. Witten, M. N. Radford, and G. C. John. Arithmetic coding for data compression. Communications of the ACM 30.6, pages 520-540, 1987. [15] A. S. Fraenkel and S. T. Klein. Novel compression of sparse bit-strings. Preliminary report. Combinatorial Algorithms on Words, Volume 12. NATO ASI Series F, pages 169-183, 1985.
  • [16] S. W. Golomb. Run-length encoding. IEEE Transactions on Information Theory, vol. IT-12, pp. 399-401, 1966.
  • [17] R. Gallager, and D. Van Voorhis. Optimal source codes for geometrically distributed integer alphabets. IEEE Transactions on Information theory 21.2,pages: 228-230, 1975.
  • [18] A. Trotman. Compressing inverted files. Information Retrieval, 6.1: 5-19, 2003.
  • [19] F. Scholer, H.E. Williams, J. Yiannis, J. Zobel. Compressionof Inverted Indexes for Fast Query Evaluation. In SIGIR 2002, pp. 222-229, 2002.
  • [20] J. Plaisance, N. Kurz, and D. Lemire. Vectorized vbyte decoding. CoRR, 2015.
  • [21] M. C. Sorkunand C. Özbey. Compression experiments on term-document index. Computer Science and Engineering (UBMK), 2017 International Conference on. IEEE, 2017.
  • [22] H. Yan, S. Ding, and T. Suel. Inverted index compression and query processing with optimized document ordering. In Proceedings of the 18th international conference on World wide web. ACM, 401–410, 2009.
  • [23] C. Papadimitriou. The NP-completeness of the bandwidth minimization problem.Computing, vol. 16, pp. 263–270, 1976.
  • [24] A. Lim, B. Rodrigues, F. Xiao. Integrated genetic algorithm with hill climbing forbandwidth minimization problem. GECCO 2003. LNCS, vol. 2724, pp. 1594–1595, 2003.
  • [25] E. Rodriguez-Tello, J. K. Hao, J. Torres-Jimenez. An improved simulated annealing algorithm for bandwidth minimization, European Journal of Operational Research 185, pp. 1319–1335, 2008.
  • [26] R. Martí, M. Laguna, F. Glover and V. Campos.Reducing the bandwidth ofa sparse matrix with tabu search. European Journal of Operational Research 135, pp. 450–459, 2001.
  • [27] E. Cuthill, J. McKee. Reducing the bandwidth of sparse symmetric matrices. In Proceedings of the 1969 24th National Conference (New York, NY, USA, 1969), ACM ’69, ACM, pp. 157–172, 1969.
  • [28] N. E. Gibbs, W. G. Poole and P. K. Stockmeyer. An algorithm for reducing the bandwidth and profile of a sparse matrix. SIAM Journal on Numerical Analysis 13, pp. 236-250, 1976.
  • [29] S. W. Sloan. An algorithm for profile and wavefront reduction of sparse matrices. International Journal for Numerical Methods in Engineering 23, 2. 239–251, 1986.
  • [30] D. Harel and Y. Koren. Graph drawing by high-dimensionalembedding. In Revised Papers from the 10th International Symposium on Graph Drawing. GD ’02, Springer-Verlag, pp. 207–219, 2002.
  • [31] I. Spence andJ. Graef. The determination of the underlying dimensionalityof an empirically obtained matrix of proximities. Multivariate Behavioral Research 9, pp. 331–341, 1974.
  • [32] Y. Cheng, G. M. Church. Biclustering of expression data. In Ismb, vol. 8, pp. 93–103, 2000.
  • [33] F. Silvestri. Sorting out the document identifier assignment problem. In Proc. of 29th European Conf. on Information Retrieval, pages 101–112, 2007.
  • [34] V. Ramaswamy, R. Konow, A. Trotman, J. Degenhardt, and N. Whyte. Document Reordering is Good, Especially for e-Commerce. In Proceedings of the SIGIR 2017 Workshop on eCommerce (ECOM 17), 2017.
  • [35] S. Büttcher, C. Clarke, and G. Cormack. Information Retrieval: Implementing and Evaluating Search Engines. MIT Press, p. 214, 2010.
  • [36] Y. Nekritch. Decoding of canonical Huffman codes with look-up tables. In Proc. Conf. Data Compression, p. 566, 2000.
There are 35 citations in total.

Details

Primary Language Turkish
Subjects Engineering
Journal Section Makaleler(Araştırma)
Authors

Can Özbey

Murat Cihan Sorkun This is me

Publication Date November 15, 2018
Published in Issue Year 2018 Volume: 11 Issue: 2

Cite

APA Özbey, C., & Sorkun, M. C. (2018). Terim-Doküman Matrisleri için Sıralamaya Dayalı Bir Kayıpsız Sıkıştırma Şeması. Türkiye Bilişim Vakfı Bilgisayar Bilimleri Ve Mühendisliği Dergisi, 11(2), 30-40.
AMA Özbey C, Sorkun MC. Terim-Doküman Matrisleri için Sıralamaya Dayalı Bir Kayıpsız Sıkıştırma Şeması. TBV-BBMD. November 2018;11(2):30-40.
Chicago Özbey, Can, and Murat Cihan Sorkun. “Terim-Doküman Matrisleri için Sıralamaya Dayalı Bir Kayıpsız Sıkıştırma Şeması”. Türkiye Bilişim Vakfı Bilgisayar Bilimleri Ve Mühendisliği Dergisi 11, no. 2 (November 2018): 30-40.
EndNote Özbey C, Sorkun MC (November 1, 2018) Terim-Doküman Matrisleri için Sıralamaya Dayalı Bir Kayıpsız Sıkıştırma Şeması. Türkiye Bilişim Vakfı Bilgisayar Bilimleri ve Mühendisliği Dergisi 11 2 30–40.
IEEE C. Özbey and M. C. Sorkun, “Terim-Doküman Matrisleri için Sıralamaya Dayalı Bir Kayıpsız Sıkıştırma Şeması”, TBV-BBMD, vol. 11, no. 2, pp. 30–40, 2018.
ISNAD Özbey, Can - Sorkun, Murat Cihan. “Terim-Doküman Matrisleri için Sıralamaya Dayalı Bir Kayıpsız Sıkıştırma Şeması”. Türkiye Bilişim Vakfı Bilgisayar Bilimleri ve Mühendisliği Dergisi 11/2 (November 2018), 30-40.
JAMA Özbey C, Sorkun MC. Terim-Doküman Matrisleri için Sıralamaya Dayalı Bir Kayıpsız Sıkıştırma Şeması. TBV-BBMD. 2018;11:30–40.
MLA Özbey, Can and Murat Cihan Sorkun. “Terim-Doküman Matrisleri için Sıralamaya Dayalı Bir Kayıpsız Sıkıştırma Şeması”. Türkiye Bilişim Vakfı Bilgisayar Bilimleri Ve Mühendisliği Dergisi, vol. 11, no. 2, 2018, pp. 30-40.
Vancouver Özbey C, Sorkun MC. Terim-Doküman Matrisleri için Sıralamaya Dayalı Bir Kayıpsız Sıkıştırma Şeması. TBV-BBMD. 2018;11(2):30-4.

Article Acceptance

Use user registration/login to upload articles online.

The acceptance process of the articles sent to the journal consists of the following stages:

1. Each submitted article is sent to at least two referees at the first stage.

2. Referee appointments are made by the journal editors. There are approximately 200 referees in the referee pool of the journal and these referees are classified according to their areas of interest. Each referee is sent an article on the subject he is interested in. The selection of the arbitrator is done in a way that does not cause any conflict of interest.

3. In the articles sent to the referees, the names of the authors are closed.

4. Referees are explained how to evaluate an article and are asked to fill in the evaluation form shown below.

5. The articles in which two referees give positive opinion are subjected to similarity review by the editors. The similarity in the articles is expected to be less than 25%.

6. A paper that has passed all stages is reviewed by the editor in terms of language and presentation, and necessary corrections and improvements are made. If necessary, the authors are notified of the situation.

0

.   This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.