Equi-Depth Histogram Construction Methodology for Big Data Tools

Tolga Büyüktanır; Ahmet Ercan Topcu

doi:10.2339/politeknik.620198

Research Article

Equi-Depth Histogram Construction Methodology for Big Data Tools

Year 2020, , 859 - 865, 01.09.2020

Tolga Büyüktanır , Ahmet Ercan Topcu

https://doi.org/10.2339/politeknik.620198

Cited By: 1

Abstract

In recent decades, countless data sources such as social media, machines, and networks are constantly pushing data into the digital world. The size of the data has been growing exponentially. To understand the statistical information of data query optimization, equi-depth histograms are essential. In this paper, we present approximate equi-depth histogram construction for big data using both Apache Pig Scripts and Java Web Interface interacting with Apache Hadoop. We use equi-depth histogram construction with quality guarantees for big data approaches and implement them with Apache Hadoop Map-Reduce and Apache Pig user-defined functions. We introduce a prototype implementation of the construction of the approximate equi-depth histogram from the Java Server Face page using Apache Hadoop jobs and the Hadoop Distributed Files System, and we evaluate these methods using the demonstration. We explain Apache Pig Scripts techniques to create equi-depth histograms using big data. The results indicate that our system provides the capability of writing multiple jobs using Apache Pig, and programmers can make use of the advantages of Apache Pig to create histograms and eliminate the complex implementation of Map-Reduce jobs.

Keywords

Approximate histogram, merging histograms, big data, log files, hadoop distributed file system

References

B. Yıldız, T. Büyüktanır, and F. Emekci, “Equi-depth histogram construction for big data with quality guarantees,” arXiv preprint arXiv:1606.05633, 2016.
D. Logothetis, C. Olston, B. Reed, K. C. Webb, and K. Yocum, “Stateful bulk processing for incremental analytics,” in Proceedings of the 1st ACM symposium on Cloud computing. ACM, 2010, pp. 51–62.
A. Thusoo, Z. Shao, S. Anthony, D. Borthakur, N. Jain, J. Sen Sarma, R. Murthy, and H. Liu, “Data warehousing and analytics infrastructure at facebook,” in Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 2010, pp. 1013–1020.
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu, and R. Murthy, “Hive-a petabyte scale data ware- house using hadoop,” in Data Engineering (ICDE), 2010 IEEE 26th International Conference on. IEEE, 2010, pp. 996–1005.
A. S. Foundation. (2008) Apache hadoop. [Online]. Available: https://hadoop.apache.org/
J. Dean and S. Ghemawat, “Mapreduce: a flexible data processing tool,” Communications of the ACM, vol. 53, no. 1, pp. 72–77, 2010.
J. Dittrich, J.-A. Quiané-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J. Schad, “Hadoop++: making a yellow elephant run like a cheetah (without it even noticing),” Proceedings of the VLDB Endowment, vol. 3,no. 1-2, pp. 515–529, 2010.
A. F. Gates, O. Natkovich, S. Chopra, P. Kamath, S. M. Narayanamurthy, C. Olston, B. Reed, S. Srinivasan, and U. Srivastava, “Building a high-level dataflow system on top of map-reduce: the pig experience,” Proceedings of the VLDB Endowment, vol. 2, no. 2, pp. 1414–1425, 2009.
A. Jindal, J.-A. Quiané-Ruiz, and J. Dittrich, “Trojan data layouts: right shoes for a running elephant,” in Proceedings of the 2nd ACM Symposium on Cloud Computing. ACM, 2011, p. 21.
M. Zaharia, A. Konwinski, A. D. Joseph, R. H. Katz, and I. Stoica, “Improving mapreduce performance in heterogeneous environments.” in OSDI, vol. 8, no. 4, 2008, p. 7.
M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, “Dryad: distributed data-parallel programs from sequential building blocks,” in ACM SIGOPS Operating Systems Review, vol. 41, no. 3. ACM, 2007, pp. 59–72.
A. Schumacher, L. Pireddu, M. Niemenmaa, A. Kallio, E. Korpelainen, G. Zanetti, and K. Heljanko, “Seqpig: simple and scalable scripting for large sequencing data sets in hadoop,” Bioinformatics, vol. 30, no. 1, pp. 119–120, 2014.
S. Wu, F. Li, S. Mehrotra, and B. C. Ooi, “Query optimization for massively parallel data processing,” in Proceedings of the 2nd ACM Symposium on Cloud Computing. ACM, 2011, p. 12.
S. Babu, “Towards automatic optimization of mapreduce programs,” in Proceedings of the 1st ACM symposium on Cloud computing. ACM, 2010, pp. 137–142.
H. Herodotou and S. Babu, “Profiling, what-if analysis, and cost-based optimization of mapreduce programs,” Proceedings of the VLDB Endowment, vol. 4, no. 11, pp. 1111–1122, 2011.
E. Jahani, M. J. Cafarella, and C. Ré, “Automatic optimization for mapreduce programs,” Proceedings of the VLDB Endowment, vol. 4, no. 6, pp. 385–396, 2011.
D. Jiang, B. C. Ooi, L. Shi, and S. Wu, “The performance of mapreduce: An in-depth study,” Proceedings of the VLDB Endowment, vol. 3, no. 1-2, pp. 472–483, 2010.
J. Dittrich, J.-A. Quiané-Ruiz, S. Richter, S. Schuh, A. Jindal, and J. Schad, “Only aggressive elephants are fast elephants,” Proceedings of the VLDB Endowment, vol. 5, no. 11, pp. 1591–1602, 2012.
A. Floratou, J. M. Patel, E. J. Shekita, and S. Tata, “Column-oriented storage techniques for mapreduce,” Proceedings of the VLDB Endowment, vol. 4, no. 7, pp. 419–429, 2011.
Y. Lin, D. Agrawal, C. Chen, B. C. Ooi, and S. Wu, “Llama: leveraging columnar storage for scalable join processing in the mapreduce framework,” in Proceedings of the 2011 ACM SIGMOD International Conference on Management of data. ACM, 2011, pp. 961–972.
Google search statistics. [Online].Available: http://www.internetlivestats.com/google-search-statistics/
Yahoo advertising. [Online]. Available: https://advertising.yahoo.com/yahoo-sites/Homepage/index.htm
Y. Ioannidis, “The history of histograms (abridged),” in Proceedings of the 29th international conference on Very large data bases-Volume 29. VLDB Endowment, 2003, pp. 19–30.
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins, “Pig latin: a not-so-foreign language for data processing,” in Proceedings of the 2008 ACM SIGMOD international conference on Management of data. ACM, 2008, pp. 1099–1110.
P. M. Hallam-Baker and B. Behlendorf, “Extended log file format,” WWW Journal, vol. 3, p. W3C, 1996.

Equi-Depth Histogram Construction Methodology for Big Data Tools

Year 2020, , 859 - 865, 01.09.2020

Tolga Büyüktanır , Ahmet Ercan Topcu

https://doi.org/10.2339/politeknik.620198

Cited By: 1

Abstract

Keywords

approximate histogram, merging histograms, big data, log files, hadoop distributed file system

References

B. Yıldız, T. Büyüktanır, and F. Emekci, “Equi-depth histogram construction for big data with quality guarantees,” arXiv preprint arXiv:1606.05633, 2016.
D. Logothetis, C. Olston, B. Reed, K. C. Webb, and K. Yocum, “Stateful bulk processing for incremental analytics,” in Proceedings of the 1st ACM symposium on Cloud computing. ACM, 2010, pp. 51–62.
A. Thusoo, Z. Shao, S. Anthony, D. Borthakur, N. Jain, J. Sen Sarma, R. Murthy, and H. Liu, “Data warehousing and analytics infrastructure at facebook,” in Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 2010, pp. 1013–1020.
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu, and R. Murthy, “Hive-a petabyte scale data ware- house using hadoop,” in Data Engineering (ICDE), 2010 IEEE 26th International Conference on. IEEE, 2010, pp. 996–1005.
A. S. Foundation. (2008) Apache hadoop. [Online]. Available: https://hadoop.apache.org/
J. Dean and S. Ghemawat, “Mapreduce: a flexible data processing tool,” Communications of the ACM, vol. 53, no. 1, pp. 72–77, 2010.
J. Dittrich, J.-A. Quiané-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J. Schad, “Hadoop++: making a yellow elephant run like a cheetah (without it even noticing),” Proceedings of the VLDB Endowment, vol. 3,no. 1-2, pp. 515–529, 2010.
A. F. Gates, O. Natkovich, S. Chopra, P. Kamath, S. M. Narayanamurthy, C. Olston, B. Reed, S. Srinivasan, and U. Srivastava, “Building a high-level dataflow system on top of map-reduce: the pig experience,” Proceedings of the VLDB Endowment, vol. 2, no. 2, pp. 1414–1425, 2009.
A. Jindal, J.-A. Quiané-Ruiz, and J. Dittrich, “Trojan data layouts: right shoes for a running elephant,” in Proceedings of the 2nd ACM Symposium on Cloud Computing. ACM, 2011, p. 21.
M. Zaharia, A. Konwinski, A. D. Joseph, R. H. Katz, and I. Stoica, “Improving mapreduce performance in heterogeneous environments.” in OSDI, vol. 8, no. 4, 2008, p. 7.
M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, “Dryad: distributed data-parallel programs from sequential building blocks,” in ACM SIGOPS Operating Systems Review, vol. 41, no. 3. ACM, 2007, pp. 59–72.
A. Schumacher, L. Pireddu, M. Niemenmaa, A. Kallio, E. Korpelainen, G. Zanetti, and K. Heljanko, “Seqpig: simple and scalable scripting for large sequencing data sets in hadoop,” Bioinformatics, vol. 30, no. 1, pp. 119–120, 2014.
S. Wu, F. Li, S. Mehrotra, and B. C. Ooi, “Query optimization for massively parallel data processing,” in Proceedings of the 2nd ACM Symposium on Cloud Computing. ACM, 2011, p. 12.
S. Babu, “Towards automatic optimization of mapreduce programs,” in Proceedings of the 1st ACM symposium on Cloud computing. ACM, 2010, pp. 137–142.
H. Herodotou and S. Babu, “Profiling, what-if analysis, and cost-based optimization of mapreduce programs,” Proceedings of the VLDB Endowment, vol. 4, no. 11, pp. 1111–1122, 2011.
E. Jahani, M. J. Cafarella, and C. Ré, “Automatic optimization for mapreduce programs,” Proceedings of the VLDB Endowment, vol. 4, no. 6, pp. 385–396, 2011.
D. Jiang, B. C. Ooi, L. Shi, and S. Wu, “The performance of mapreduce: An in-depth study,” Proceedings of the VLDB Endowment, vol. 3, no. 1-2, pp. 472–483, 2010.
J. Dittrich, J.-A. Quiané-Ruiz, S. Richter, S. Schuh, A. Jindal, and J. Schad, “Only aggressive elephants are fast elephants,” Proceedings of the VLDB Endowment, vol. 5, no. 11, pp. 1591–1602, 2012.
A. Floratou, J. M. Patel, E. J. Shekita, and S. Tata, “Column-oriented storage techniques for mapreduce,” Proceedings of the VLDB Endowment, vol. 4, no. 7, pp. 419–429, 2011.
Y. Lin, D. Agrawal, C. Chen, B. C. Ooi, and S. Wu, “Llama: leveraging columnar storage for scalable join processing in the mapreduce framework,” in Proceedings of the 2011 ACM SIGMOD International Conference on Management of data. ACM, 2011, pp. 961–972.
Google search statistics. [Online].Available: http://www.internetlivestats.com/google-search-statistics/
Yahoo advertising. [Online]. Available: https://advertising.yahoo.com/yahoo-sites/Homepage/index.htm
Y. Ioannidis, “The history of histograms (abridged),” in Proceedings of the 29th international conference on Very large data bases-Volume 29. VLDB Endowment, 2003, pp. 19–30.
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins, “Pig latin: a not-so-foreign language for data processing,” in Proceedings of the 2008 ACM SIGMOD international conference on Management of data. ACM, 2008, pp. 1099–1110.
P. M. Hallam-Baker and B. Behlendorf, “Extended log file format,” WWW Journal, vol. 3, p. W3C, 1996.

There are 25 citations in total.

Details

Primary Language	English
Subjects	Engineering
Journal Section	Research Article
Authors	Tolga Büyüktanır 0000-0001-5317-0028 Ahmet Ercan Topcu 0000-0003-1929-5358
Publication Date	September 1, 2020
Submission Date	September 13, 2019
Published in Issue	Year 2020

Cite

APA	Büyüktanır, T., & Topcu, A. E. (2020). Equi-Depth Histogram Construction Methodology for Big Data Tools. Politeknik Dergisi, 23(3), 859-865. https://doi.org/10.2339/politeknik.620198
AMA	Büyüktanır T, Topcu AE. Equi-Depth Histogram Construction Methodology for Big Data Tools. Politeknik Dergisi. September 2020;23(3):859-865. doi:10.2339/politeknik.620198
Chicago	Büyüktanır, Tolga, and Ahmet Ercan Topcu. “Equi-Depth Histogram Construction Methodology for Big Data Tools”. Politeknik Dergisi 23, no. 3 (September 2020): 859-65. https://doi.org/10.2339/politeknik.620198.
EndNote	Büyüktanır T, Topcu AE (September 1, 2020) Equi-Depth Histogram Construction Methodology for Big Data Tools. Politeknik Dergisi 23 3 859–865.
IEEE	T. Büyüktanır and A. E. Topcu, “Equi-Depth Histogram Construction Methodology for Big Data Tools”, Politeknik Dergisi, vol. 23, no. 3, pp. 859–865, 2020, doi: 10.2339/politeknik.620198.
ISNAD	Büyüktanır, Tolga - Topcu, Ahmet Ercan. “Equi-Depth Histogram Construction Methodology for Big Data Tools”. Politeknik Dergisi 23/3 (September 2020), 859-865. https://doi.org/10.2339/politeknik.620198.
JAMA	Büyüktanır T, Topcu AE. Equi-Depth Histogram Construction Methodology for Big Data Tools. Politeknik Dergisi. 2020;23:859–865.
MLA	Büyüktanır, Tolga and Ahmet Ercan Topcu. “Equi-Depth Histogram Construction Methodology for Big Data Tools”. Politeknik Dergisi, vol. 23, no. 3, 2020, pp. 859-65, doi:10.2339/politeknik.620198.
Vancouver	Büyüktanır T, Topcu AE. Equi-Depth Histogram Construction Methodology for Big Data Tools. Politeknik Dergisi. 2020;23(3):859-65.

Cited By

Predictive Prefetching in Client–Server Systems: A Navigational Behavior Modeling Approach

International Journal of Software Engineering and Knowledge Engineering

https://doi.org/10.1142/S0218194024500384

Article Files

Full Text

TARANDIĞIMIZ DİZİNLER (ABSTRACTING / INDEXING)

download Bu eser Creative Commons Atıf-AynıLisanslaPaylaş 4.0 Uluslararası ile lisanslanmıştır.