Research Article
BibTex RIS Cite

Equi-Depth Histogram Construction Methodology for Big Data Tools

Year 2020, Volume: 23 Issue: 3, 859 - 865, 01.09.2020
https://doi.org/10.2339/politeknik.620198

Abstract

In recent decades, countless data sources such as social media, machines, and networks are constantly pushing data into the digital world. The size of the data has been growing exponentially. To understand the statistical information of data query optimization, equi-depth histograms are essential. In this paper, we present approximate equi-depth histogram construction for big data using both Apache Pig Scripts and Java Web Interface interacting with Apache Hadoop. We use equi-depth histogram construction with quality guarantees for big data approaches and implement them with Apache Hadoop Map-Reduce and Apache Pig user-defined functions. We introduce a prototype implementation of the construction of the approximate equi-depth histogram from the Java Server Face page using Apache Hadoop jobs and the Hadoop Distributed Files System, and we evaluate these methods using the demonstration. We explain Apache Pig Scripts techniques to create equi-depth histograms using big data. The results indicate that our system provides the capability of writing multiple jobs using Apache Pig, and programmers can make use of the advantages of Apache Pig to create histograms and eliminate the complex implementation of Map-Reduce jobs.

References

  • B. Yıldız, T. Büyüktanır, and F. Emekci, “Equi-depth histogram construction for big data with quality guarantees,” arXiv preprint arXiv:1606.05633, 2016.
  • D. Logothetis, C. Olston, B. Reed, K. C. Webb, and K. Yocum, “Stateful bulk processing for incremental analytics,” in Proceedings of the 1st ACM symposium on Cloud computing. ACM, 2010, pp. 51–62.
  • A. Thusoo, Z. Shao, S. Anthony, D. Borthakur, N. Jain, J. Sen Sarma, R. Murthy, and H. Liu, “Data warehousing and analytics infrastructure at facebook,” in Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 2010, pp. 1013–1020.
  • A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu, and R. Murthy, “Hive-a petabyte scale data ware- house using hadoop,” in Data Engineering (ICDE), 2010 IEEE 26th International Conference on. IEEE, 2010, pp. 996–1005.
  • A. S. Foundation. (2008) Apache hadoop. [Online]. Available: https://hadoop.apache.org/
  • J. Dean and S. Ghemawat, “Mapreduce: a flexible data processing tool,” Communications of the ACM, vol. 53, no. 1, pp. 72–77, 2010.
  • J. Dittrich, J.-A. Quiané-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J. Schad, “Hadoop++: making a yellow elephant run like a cheetah (without it even noticing),” Proceedings of the VLDB Endowment, vol. 3,no. 1-2, pp. 515–529, 2010.
  • A. F. Gates, O. Natkovich, S. Chopra, P. Kamath, S. M. Narayanamurthy, C. Olston, B. Reed, S. Srinivasan, and U. Srivastava, “Building a high-level dataflow system on top of map-reduce: the pig experience,” Proceedings of the VLDB Endowment, vol. 2, no. 2, pp. 1414–1425, 2009.
  • A. Jindal, J.-A. Quiané-Ruiz, and J. Dittrich, “Trojan data layouts: right shoes for a running elephant,” in Proceedings of the 2nd ACM Symposium on Cloud Computing. ACM, 2011, p. 21.
  • M. Zaharia, A. Konwinski, A. D. Joseph, R. H. Katz, and I. Stoica, “Improving mapreduce performance in heterogeneous environments.” in OSDI, vol. 8, no. 4, 2008, p. 7.
  • M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, “Dryad: distributed data-parallel programs from sequential building blocks,” in ACM SIGOPS Operating Systems Review, vol. 41, no. 3. ACM, 2007, pp. 59–72.
  • A. Schumacher, L. Pireddu, M. Niemenmaa, A. Kallio, E. Korpelainen, G. Zanetti, and K. Heljanko, “Seqpig: simple and scalable scripting for large sequencing data sets in hadoop,” Bioinformatics, vol. 30, no. 1, pp. 119–120, 2014.
  • S. Wu, F. Li, S. Mehrotra, and B. C. Ooi, “Query optimization for massively parallel data processing,” in Proceedings of the 2nd ACM Symposium on Cloud Computing. ACM, 2011, p. 12.
  • S. Babu, “Towards automatic optimization of mapreduce programs,” in Proceedings of the 1st ACM symposium on Cloud computing. ACM, 2010, pp. 137–142.
  • H. Herodotou and S. Babu, “Profiling, what-if analysis, and cost-based optimization of mapreduce programs,” Proceedings of the VLDB Endowment, vol. 4, no. 11, pp. 1111–1122, 2011.
  • E. Jahani, M. J. Cafarella, and C. Ré, “Automatic optimization for mapreduce programs,” Proceedings of the VLDB Endowment, vol. 4, no. 6, pp. 385–396, 2011.
  • D. Jiang, B. C. Ooi, L. Shi, and S. Wu, “The performance of mapreduce: An in-depth study,” Proceedings of the VLDB Endowment, vol. 3, no. 1-2, pp. 472–483, 2010.
  • J. Dittrich, J.-A. Quiané-Ruiz, S. Richter, S. Schuh, A. Jindal, and J. Schad, “Only aggressive elephants are fast elephants,” Proceedings of the VLDB Endowment, vol. 5, no. 11, pp. 1591–1602, 2012.
  • A. Floratou, J. M. Patel, E. J. Shekita, and S. Tata, “Column-oriented storage techniques for mapreduce,” Proceedings of the VLDB Endowment, vol. 4, no. 7, pp. 419–429, 2011.
  • Y. Lin, D. Agrawal, C. Chen, B. C. Ooi, and S. Wu, “Llama: leveraging columnar storage for scalable join processing in the mapreduce framework,” in Proceedings of the 2011 ACM SIGMOD International Conference on Management of data. ACM, 2011, pp. 961–972.
  • Google search statistics. [Online].Available: http://www.internetlivestats.com/google-search-statistics/
  • Yahoo advertising. [Online]. Available: https://advertising.yahoo.com/yahoo-sites/Homepage/index.htm
  • Y. Ioannidis, “The history of histograms (abridged),” in Proceedings of the 29th international conference on Very large data bases-Volume 29. VLDB Endowment, 2003, pp. 19–30.
  • C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins, “Pig latin: a not-so-foreign language for data processing,” in Proceedings of the 2008 ACM SIGMOD international conference on Management of data. ACM, 2008, pp. 1099–1110.
  • P. M. Hallam-Baker and B. Behlendorf, “Extended log file format,” WWW Journal, vol. 3, p. W3C, 1996.

Equi-Depth Histogram Construction Methodology for Big Data Tools

Year 2020, Volume: 23 Issue: 3, 859 - 865, 01.09.2020
https://doi.org/10.2339/politeknik.620198

Abstract

In recent decades, countless data sources such as social media, machines, and networks are constantly pushing data into the digital world. The size of the data has been growing exponentially. To understand the statistical information of data query optimization, equi-depth histograms are essential. In this paper, we present approximate equi-depth histogram construction for big data using both Apache Pig Scripts and Java Web Interface interacting with Apache Hadoop. We use equi-depth histogram construction with quality guarantees for big data approaches and implement them with Apache Hadoop Map-Reduce and Apache Pig user-defined functions. We introduce a prototype implementation of the construction of the approximate equi-depth histogram from the Java Server Face page using Apache Hadoop jobs and the Hadoop Distributed Files System, and we evaluate these methods using the demonstration. We explain Apache Pig Scripts techniques to create equi-depth histograms using big data. The results indicate that our system provides the capability of writing multiple jobs using Apache Pig, and programmers can make use of the advantages of Apache Pig to create histograms and eliminate the complex implementation of Map-Reduce jobs.

References

  • B. Yıldız, T. Büyüktanır, and F. Emekci, “Equi-depth histogram construction for big data with quality guarantees,” arXiv preprint arXiv:1606.05633, 2016.
  • D. Logothetis, C. Olston, B. Reed, K. C. Webb, and K. Yocum, “Stateful bulk processing for incremental analytics,” in Proceedings of the 1st ACM symposium on Cloud computing. ACM, 2010, pp. 51–62.
  • A. Thusoo, Z. Shao, S. Anthony, D. Borthakur, N. Jain, J. Sen Sarma, R. Murthy, and H. Liu, “Data warehousing and analytics infrastructure at facebook,” in Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 2010, pp. 1013–1020.
  • A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu, and R. Murthy, “Hive-a petabyte scale data ware- house using hadoop,” in Data Engineering (ICDE), 2010 IEEE 26th International Conference on. IEEE, 2010, pp. 996–1005.
  • A. S. Foundation. (2008) Apache hadoop. [Online]. Available: https://hadoop.apache.org/
  • J. Dean and S. Ghemawat, “Mapreduce: a flexible data processing tool,” Communications of the ACM, vol. 53, no. 1, pp. 72–77, 2010.
  • J. Dittrich, J.-A. Quiané-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J. Schad, “Hadoop++: making a yellow elephant run like a cheetah (without it even noticing),” Proceedings of the VLDB Endowment, vol. 3,no. 1-2, pp. 515–529, 2010.
  • A. F. Gates, O. Natkovich, S. Chopra, P. Kamath, S. M. Narayanamurthy, C. Olston, B. Reed, S. Srinivasan, and U. Srivastava, “Building a high-level dataflow system on top of map-reduce: the pig experience,” Proceedings of the VLDB Endowment, vol. 2, no. 2, pp. 1414–1425, 2009.
  • A. Jindal, J.-A. Quiané-Ruiz, and J. Dittrich, “Trojan data layouts: right shoes for a running elephant,” in Proceedings of the 2nd ACM Symposium on Cloud Computing. ACM, 2011, p. 21.
  • M. Zaharia, A. Konwinski, A. D. Joseph, R. H. Katz, and I. Stoica, “Improving mapreduce performance in heterogeneous environments.” in OSDI, vol. 8, no. 4, 2008, p. 7.
  • M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, “Dryad: distributed data-parallel programs from sequential building blocks,” in ACM SIGOPS Operating Systems Review, vol. 41, no. 3. ACM, 2007, pp. 59–72.
  • A. Schumacher, L. Pireddu, M. Niemenmaa, A. Kallio, E. Korpelainen, G. Zanetti, and K. Heljanko, “Seqpig: simple and scalable scripting for large sequencing data sets in hadoop,” Bioinformatics, vol. 30, no. 1, pp. 119–120, 2014.
  • S. Wu, F. Li, S. Mehrotra, and B. C. Ooi, “Query optimization for massively parallel data processing,” in Proceedings of the 2nd ACM Symposium on Cloud Computing. ACM, 2011, p. 12.
  • S. Babu, “Towards automatic optimization of mapreduce programs,” in Proceedings of the 1st ACM symposium on Cloud computing. ACM, 2010, pp. 137–142.
  • H. Herodotou and S. Babu, “Profiling, what-if analysis, and cost-based optimization of mapreduce programs,” Proceedings of the VLDB Endowment, vol. 4, no. 11, pp. 1111–1122, 2011.
  • E. Jahani, M. J. Cafarella, and C. Ré, “Automatic optimization for mapreduce programs,” Proceedings of the VLDB Endowment, vol. 4, no. 6, pp. 385–396, 2011.
  • D. Jiang, B. C. Ooi, L. Shi, and S. Wu, “The performance of mapreduce: An in-depth study,” Proceedings of the VLDB Endowment, vol. 3, no. 1-2, pp. 472–483, 2010.
  • J. Dittrich, J.-A. Quiané-Ruiz, S. Richter, S. Schuh, A. Jindal, and J. Schad, “Only aggressive elephants are fast elephants,” Proceedings of the VLDB Endowment, vol. 5, no. 11, pp. 1591–1602, 2012.
  • A. Floratou, J. M. Patel, E. J. Shekita, and S. Tata, “Column-oriented storage techniques for mapreduce,” Proceedings of the VLDB Endowment, vol. 4, no. 7, pp. 419–429, 2011.
  • Y. Lin, D. Agrawal, C. Chen, B. C. Ooi, and S. Wu, “Llama: leveraging columnar storage for scalable join processing in the mapreduce framework,” in Proceedings of the 2011 ACM SIGMOD International Conference on Management of data. ACM, 2011, pp. 961–972.
  • Google search statistics. [Online].Available: http://www.internetlivestats.com/google-search-statistics/
  • Yahoo advertising. [Online]. Available: https://advertising.yahoo.com/yahoo-sites/Homepage/index.htm
  • Y. Ioannidis, “The history of histograms (abridged),” in Proceedings of the 29th international conference on Very large data bases-Volume 29. VLDB Endowment, 2003, pp. 19–30.
  • C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins, “Pig latin: a not-so-foreign language for data processing,” in Proceedings of the 2008 ACM SIGMOD international conference on Management of data. ACM, 2008, pp. 1099–1110.
  • P. M. Hallam-Baker and B. Behlendorf, “Extended log file format,” WWW Journal, vol. 3, p. W3C, 1996.
There are 25 citations in total.

Details

Primary Language English
Subjects Engineering
Journal Section Research Article
Authors

Tolga Büyüktanır 0000-0001-5317-0028

Ahmet Ercan Topcu 0000-0003-1929-5358

Publication Date September 1, 2020
Submission Date September 13, 2019
Published in Issue Year 2020 Volume: 23 Issue: 3

Cite

APA Büyüktanır, T., & Topcu, A. E. (2020). Equi-Depth Histogram Construction Methodology for Big Data Tools. Politeknik Dergisi, 23(3), 859-865. https://doi.org/10.2339/politeknik.620198
AMA Büyüktanır T, Topcu AE. Equi-Depth Histogram Construction Methodology for Big Data Tools. Politeknik Dergisi. September 2020;23(3):859-865. doi:10.2339/politeknik.620198
Chicago Büyüktanır, Tolga, and Ahmet Ercan Topcu. “Equi-Depth Histogram Construction Methodology for Big Data Tools”. Politeknik Dergisi 23, no. 3 (September 2020): 859-65. https://doi.org/10.2339/politeknik.620198.
EndNote Büyüktanır T, Topcu AE (September 1, 2020) Equi-Depth Histogram Construction Methodology for Big Data Tools. Politeknik Dergisi 23 3 859–865.
IEEE T. Büyüktanır and A. E. Topcu, “Equi-Depth Histogram Construction Methodology for Big Data Tools”, Politeknik Dergisi, vol. 23, no. 3, pp. 859–865, 2020, doi: 10.2339/politeknik.620198.
ISNAD Büyüktanır, Tolga - Topcu, Ahmet Ercan. “Equi-Depth Histogram Construction Methodology for Big Data Tools”. Politeknik Dergisi 23/3 (September 2020), 859-865. https://doi.org/10.2339/politeknik.620198.
JAMA Büyüktanır T, Topcu AE. Equi-Depth Histogram Construction Methodology for Big Data Tools. Politeknik Dergisi. 2020;23:859–865.
MLA Büyüktanır, Tolga and Ahmet Ercan Topcu. “Equi-Depth Histogram Construction Methodology for Big Data Tools”. Politeknik Dergisi, vol. 23, no. 3, 2020, pp. 859-65, doi:10.2339/politeknik.620198.
Vancouver Büyüktanır T, Topcu AE. Equi-Depth Histogram Construction Methodology for Big Data Tools. Politeknik Dergisi. 2020;23(3):859-65.