Equi-Depth Histogram Construction Methodology for Big Data Tools

Tolga Büyüktanır; Ahmet Ercan Topcu

doi:10.2339/politeknik.620198

TR EN

Equi-Depth Histogram Construction Methodology for Big Data Tools

Öz

In recent decades, countless data sources such as social media, machines, and networks are constantly pushing data into the digital world. The size of the data has been growing exponentially. To understand the statistical information of data query optimization, equi-depth histograms are essential. In this paper, we present approximate equi-depth histogram construction for big data using both Apache Pig Scripts and Java Web Interface interacting with Apache Hadoop. We use equi-depth histogram construction with quality guarantees for big data approaches and implement them with Apache Hadoop Map-Reduce and Apache Pig user-defined functions. We introduce a prototype implementation of the construction of the approximate equi-depth histogram from the Java Server Face page using Apache Hadoop jobs and the Hadoop Distributed Files System, and we evaluate these methods using the demonstration. We explain Apache Pig Scripts techniques to create equi-depth histograms using big data. The results indicate that our system provides the capability of writing multiple jobs using Apache Pig, and programmers can make use of the advantages of Apache Pig to create histograms and eliminate the complex implementation of Map-Reduce jobs.

Anahtar Kelimeler

approximate histogram,merging histograms,big data,log files,hadoop distributed file system

Kaynakça

B. Yıldız, T. Büyüktanır, and F. Emekci, “Equi-depth histogram construction for big data with quality guarantees,” arXiv preprint arXiv:1606.05633, 2016.
D. Logothetis, C. Olston, B. Reed, K. C. Webb, and K. Yocum, “Stateful bulk processing for incremental analytics,” in Proceedings of the 1st ACM symposium on Cloud computing. ACM, 2010, pp. 51–62.
A. Thusoo, Z. Shao, S. Anthony, D. Borthakur, N. Jain, J. Sen Sarma, R. Murthy, and H. Liu, “Data warehousing and analytics infrastructure at facebook,” in Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 2010, pp. 1013–1020.
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu, and R. Murthy, “Hive-a petabyte scale data ware- house using hadoop,” in Data Engineering (ICDE), 2010 IEEE 26th International Conference on. IEEE, 2010, pp. 996–1005.
A. S. Foundation. (2008) Apache hadoop. [Online]. Available: https://hadoop.apache.org/
J. Dean and S. Ghemawat, “Mapreduce: a flexible data processing tool,” Communications of the ACM, vol. 53, no. 1, pp. 72–77, 2010.
J. Dittrich, J.-A. Quiané-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J. Schad, “Hadoop++: making a yellow elephant run like a cheetah (without it even noticing),” Proceedings of the VLDB Endowment, vol. 3,no. 1-2, pp. 515–529, 2010.
A. F. Gates, O. Natkovich, S. Chopra, P. Kamath, S. M. Narayanamurthy, C. Olston, B. Reed, S. Srinivasan, and U. Srivastava, “Building a high-level dataflow system on top of map-reduce: the pig experience,” Proceedings of the VLDB Endowment, vol. 2, no. 2, pp. 1414–1425, 2009.

A. Jindal, J.-A. Quiané-Ruiz, and J. Dittrich, “Trojan data layouts: right shoes for a running elephant,” in Proceedings of the 2nd ACM Symposium on Cloud Computing. ACM, 2011, p. 21.
M. Zaharia, A. Konwinski, A. D. Joseph, R. H. Katz, and I. Stoica, “Improving mapreduce performance in heterogeneous environments.” in OSDI, vol. 8, no. 4, 2008, p. 7.
M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, “Dryad: distributed data-parallel programs from sequential building blocks,” in ACM SIGOPS Operating Systems Review, vol. 41, no. 3. ACM, 2007, pp. 59–72.
A. Schumacher, L. Pireddu, M. Niemenmaa, A. Kallio, E. Korpelainen, G. Zanetti, and K. Heljanko, “Seqpig: simple and scalable scripting for large sequencing data sets in hadoop,” Bioinformatics, vol. 30, no. 1, pp. 119–120, 2014.
S. Wu, F. Li, S. Mehrotra, and B. C. Ooi, “Query optimization for massively parallel data processing,” in Proceedings of the 2nd ACM Symposium on Cloud Computing. ACM, 2011, p. 12.
S. Babu, “Towards automatic optimization of mapreduce programs,” in Proceedings of the 1st ACM symposium on Cloud computing. ACM, 2010, pp. 137–142.
H. Herodotou and S. Babu, “Profiling, what-if analysis, and cost-based optimization of mapreduce programs,” Proceedings of the VLDB Endowment, vol. 4, no. 11, pp. 1111–1122, 2011.
E. Jahani, M. J. Cafarella, and C. Ré, “Automatic optimization for mapreduce programs,” Proceedings of the VLDB Endowment, vol. 4, no. 6, pp. 385–396, 2011.
D. Jiang, B. C. Ooi, L. Shi, and S. Wu, “The performance of mapreduce: An in-depth study,” Proceedings of the VLDB Endowment, vol. 3, no. 1-2, pp. 472–483, 2010.
J. Dittrich, J.-A. Quiané-Ruiz, S. Richter, S. Schuh, A. Jindal, and J. Schad, “Only aggressive elephants are fast elephants,” Proceedings of the VLDB Endowment, vol. 5, no. 11, pp. 1591–1602, 2012.
A. Floratou, J. M. Patel, E. J. Shekita, and S. Tata, “Column-oriented storage techniques for mapreduce,” Proceedings of the VLDB Endowment, vol. 4, no. 7, pp. 419–429, 2011.
Y. Lin, D. Agrawal, C. Chen, B. C. Ooi, and S. Wu, “Llama: leveraging columnar storage for scalable join processing in the mapreduce framework,” in Proceedings of the 2011 ACM SIGMOD International Conference on Management of data. ACM, 2011, pp. 961–972.
Google search statistics. [Online].Available: http://www.internetlivestats.com/google-search-statistics/
Yahoo advertising. [Online]. Available: https://advertising.yahoo.com/yahoo-sites/Homepage/index.htm
Y. Ioannidis, “The history of histograms (abridged),” in Proceedings of the 29th international conference on Very large data bases-Volume 29. VLDB Endowment, 2003, pp. 19–30.
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins, “Pig latin: a not-so-foreign language for data processing,” in Proceedings of the 2008 ACM SIGMOD international conference on Management of data. ACM, 2008, pp. 1099–1110.
P. M. Hallam-Baker and B. Behlendorf, “Extended log file format,” WWW Journal, vol. 3, p. W3C, 1996.

Ayrıntılar

Birincil Dil

İngilizce

Konular

Mühendislik

Bölüm

Araştırma Makalesi

Yazarlar

Tolga Büyüktanır ^*
0000-0001-5317-0028
Türkiye

Ahmet Ercan Topcu
0000-0003-1929-5358
Türkiye

Yayımlanma Tarihi

1 Eylül 2020

Gönderilme Tarihi

13 Eylül 2019

Kabul Tarihi

1 Nisan 2020

Yayımlandığı Sayı

Yıl 2020 Cilt: 23 Sayı: 3

DOI

https://doi.org/10.2339/politeknik.620198

IZ

https://izlik.org/JA42MX99BR

Kaynak Göster

RIS / Bibtex

APA

Büyüktanır, T., & Topcu, A. E. (2020). Equi-Depth Histogram Construction Methodology for Big Data Tools. Politeknik Dergisi, 23(3), 859-865. https://doi.org/10.2339/politeknik.620198

AMA

1.Büyüktanır T, Topcu AE. Equi-Depth Histogram Construction Methodology for Big Data Tools. Politeknik Dergisi. 2020;23(3):859-865. doi:10.2339/politeknik.620198

Chicago

Büyüktanır, Tolga, ve Ahmet Ercan Topcu. 2020. “Equi-Depth Histogram Construction Methodology for Big Data Tools”. Politeknik Dergisi 23 (3): 859-65. https://doi.org/10.2339/politeknik.620198.

EndNote

Büyüktanır T, Topcu AE (01 Eylül 2020) Equi-Depth Histogram Construction Methodology for Big Data Tools. Politeknik Dergisi 23 3 859–865.

IEEE

[1]T. Büyüktanır ve A. E. Topcu, “Equi-Depth Histogram Construction Methodology for Big Data Tools”, Politeknik Dergisi, c. 23, sy 3, ss. 859–865, Eyl. 2020, doi: 10.2339/politeknik.620198.

ISNAD

Büyüktanır, Tolga - Topcu, Ahmet Ercan. “Equi-Depth Histogram Construction Methodology for Big Data Tools”. Politeknik Dergisi 23/3 (01 Eylül 2020): 859-865. https://doi.org/10.2339/politeknik.620198.

JAMA

1.Büyüktanır T, Topcu AE. Equi-Depth Histogram Construction Methodology for Big Data Tools. Politeknik Dergisi. 2020;23:859–865.

MLA

Büyüktanır, Tolga, ve Ahmet Ercan Topcu. “Equi-Depth Histogram Construction Methodology for Big Data Tools”. Politeknik Dergisi, c. 23, sy 3, Eylül 2020, ss. 859-65, doi:10.2339/politeknik.620198.

Vancouver

1.Tolga Büyüktanır, Ahmet Ercan Topcu. Equi-Depth Histogram Construction Methodology for Big Data Tools. Politeknik Dergisi. 01 Eylül 2020;23(3):859-65. doi:10.2339/politeknik.620198

Cited By

Predictive Prefetching in Client–Server Systems: A Navigational Behavior Modeling Approach

International Journal of Software Engineering and Knowledge Engineering

https://doi.org/10.1142/S0218194024500384