SCALABLE IMPLEMENTATIONS OF DESCRIPTIVE STATISTICS ON HADOOP

Özgür Yılmazel

Araştırma Makalesi

HADOOP ÜZERİNDE ÖLÇEKLENEBİLİR BETİMLEYİCİ İSTATİSTİK UYGULAMALARI

Yıl 2019, Cilt: 1 Sayı: 1, 43 - 58, 30.06.2019

Özgür Yılmazel

Öz

Büyük
Veri, İngilizce dilindeki karşılığı ile Big Data, çağımızın en güncel
teknolojilerinden biri olarak karşımıza çıkmaktadır. Sosyal medya, sensör
verileri, Nesnelerin İnternet’i gibi seri halde veri üreten teknolojilerin
sayesinde veri hacmi gün geçtikçe artmaktadır. Dünyada veri miktarındaki büyük
artış, büyük verinin depolanması, işlenmesi ve analiz edilmesi için farklı
yaklaşımlar gerektirmektedir. Bir nicel veriseti birçok özelliğe sahiptir ve
betimleyici istatistikler veri setindeki bu özellikleri her bir değeri
listelemek zorunda kalmadan anlamlı ve yönetilebilir bir biçimde
tanımlayabilir. Bununla birlikte, standart istatistiksel teknikler, verinin
büyüklüğü, karmaşıklığı ve hızı nedeniyle büyük verilere uygun olmayabilir.
Nicel verileri analiz etmek için kullanıma hazır çok sayıda istatistiksel araç
olmasına rağmen, her zaman büyük veri dosya sistemleri ile çalışmak için uyumlu
değildir. Bu yazıda, betimleyici istatistik algoritmalarının büyük veri setleri
üzerindeki uygulamaları sergilenmektedir ve deneylerin 196 yivli küçük bir
Hadoop kümesinde ölçeklenebilirliğini gösterilmektedir. Bu çalışma, büyük veri
kümeleri için tanımlayıcı istatistiklerin bir Hadoop kümesinin dağıtılmış
hesaplama özelliklerinden yararlanabileceğini göstermektedir. Çalışma TÜBİTAK TEYDEB
desteği ile tamamlanmıştır.

Anahtar Kelimeler

Büyük Veri, Betimleyici İstatistik, Hadoop, MapReduce

Kaynakça

Apache Software Foundation, Hadoop Releases, apache.org, Dec. 10, 2011. [Online]. http://en.wikipedia.org/wiki/Apache_Hadoop. [Accessed: Oct. 06, 2018]
Battiato, S., Cantone, D., Catalano, D., Cincotti, G., and Hofri, M. (2000), An efficient algorithm for the approximate median selection problem. Algorithms and complexity, 226-238.
Buragohain C., and Suri S. (2009), Encyclopedia of Database Systems, 2235-2240, Springer US.
Chen C. P., and Zhang C.Y. (2014), Data-intensive applications, challenges, techniques and technologies: A survey on Big Data, Information Sciences ,275, 314-347
Cheung P. (2012), Big Data, Official Statistics and Social Science Research: Emerging Data Challenges, Presentation at the World Bank.
Ciaccio A. Di, Coli M., Ibanez A., and Miguel J. (2012), Advanced Statistical Methods for the Analysis of Large Data-Sets.
Daas P., Tennekes M., Jonge E. De, Priem A., Buelens B., Pelt M. Van, and Hurk P. Van Den (2012), Data Science and the Future of Statistics Presentation at the first Data Science NL meetup, http://www.slideshare.net/pietdaas/data-science-and-the-future-of-statistics.
Douglas L. (2001), 3d data management: Controlling data volume, velocity and variety, https://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf.
Glasson M., Trepanier J., Patruno V., Daas P., and Skaliotis M., Khan A. (2013), What does “Big Data” mean for Official Statistics? https://statswiki.unece.org/pages/viewpage.action?pageId=77170614&preview=/77170614/80805923/Big%20Data%20HLG%20Final%20Published%20Version.docx.
Gobble M. (2013), Big Data: the next big thing in innovation, Research-Technology Management, 56, 64–66.
Hastie T., Tibshirani R., and Friedman J. (2002), The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer, Stanford.
HDFS Architecture, https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html.
HDFS Users Guide, http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html.
Hive, https://hive.apache.org.
Jain, R., and Chlamtac, I. (1985). The P2 algorithm for dynamic calculation of quantiles and histograms without storing observations. Communications of the ACM, 28(10), 1076-1085.
Kelley, I., and Blumenstock, J. (2014). Computational challenges in the analysis of large, sparse, spatiotemporal data. In Proceedings of the sixth international workshop on Data intensive distributed computing, 41-46. ACM.
Klemens B. (2008), Modeling with Data: Tools and Techniques for Statistical Computing, Princeton University Press.
Manku, G. S., Rajagopalan, S., and Lindsay, B. G. (1998), Approximate medians and other quantiles in one pass and with limited memory. In ACM SIGMOD Record, 27 (2), 426-435.
MapReduce, https://wiki.apache.org/hadoop/MapReduce.
Mardia, K. V., and Zemroch, P. J. (1975), Algorithm AS 84: Measures of multivariate skewness and kurtosis. Journal of the Royal Statistical Society. Series C (Applied Statistics), 24(2), 262-265.
Munro J.J., and Paterson M.S. (1980), Selection and sorting with limited storage, Theor. Comput. Sci., 12, 315–323.
NIST/SEMATECH, e-Handbook of Statistical Methods, http://www.itl.nist.gov/div898/handbook/, [Accessed: 19.01.2018].
NSF (2012), Core techniques and technologies for advancing big data science and engineering (BIGDATA), https://www.nsf.gov/pubs/2012/nsf12499/nsf12499.htm.
Philippe P., Thompson D., Bennett J., and Mascarenhas A. (2011), Design and performance of a scalable, parallel statistics toolkit, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), 1475–1484.
Spark, https://spark.apache.org/documentation.html.
Sysoev O., Oleg B., and Grimvalla A. (2011), A segmentation-based algorithm for large-scale partially ordered monotonic regression, Comput. Stat.Data Anal, 55 (8), 2463–2476.
White T., (2012), Hadoop the Definitive Guide, 3rd Edition, O’Reilly Media.
Wilkinson L., (2008), The future of statistical computing, Technometrics, 50 (4), 418–435.
Yarn, https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html.

SCALABLE IMPLEMENTATIONS OF DESCRIPTIVE STATISTICS ON HADOOP

Yıl 2019, Cilt: 1 Sayı: 1, 43 - 58, 30.06.2019

Özgür Yılmazel

Öz

Big Data is one of the most trendy technologies of our
time. The volume of data is increasing day by day, thanks to serial data
generation technologies such as social media, sensor data, Internet of Things. The
massive increase in the amount of data accumulated around the world requires
different approaches to store, process and analyze the big data. A set of
quantitative data has many features and the descriptive statistics can describe
these features in a meaningful and manageable form without having to list every
value in the dataset. However, the standard statistical techniques cannot suit
big data due to the size, complexity and velocity of the data. Though there are
many off-the-shelf statistical tools available to analyze quantitative data
they are not always compatible with the big data file systems. In this paper,
we describe our implementations of the descriptive statistics algorithms over
big data and show the scalability of our experiments on a small Hadoop cluster
with 196 threads. This study presents that descriptive statistics for large
datasets can benefit from distributed computation features of a Hadoop cluster.

Anahtar Kelimeler

Big Data, Descriptive Statistics, Hadoop, MapReduce

Kaynakça

Apache Software Foundation, Hadoop Releases, apache.org, Dec. 10, 2011. [Online]. http://en.wikipedia.org/wiki/Apache_Hadoop. [Accessed: Oct. 06, 2018]
Battiato, S., Cantone, D., Catalano, D., Cincotti, G., and Hofri, M. (2000), An efficient algorithm for the approximate median selection problem. Algorithms and complexity, 226-238.
Buragohain C., and Suri S. (2009), Encyclopedia of Database Systems, 2235-2240, Springer US.
Chen C. P., and Zhang C.Y. (2014), Data-intensive applications, challenges, techniques and technologies: A survey on Big Data, Information Sciences ,275, 314-347
Cheung P. (2012), Big Data, Official Statistics and Social Science Research: Emerging Data Challenges, Presentation at the World Bank.
Ciaccio A. Di, Coli M., Ibanez A., and Miguel J. (2012), Advanced Statistical Methods for the Analysis of Large Data-Sets.
Daas P., Tennekes M., Jonge E. De, Priem A., Buelens B., Pelt M. Van, and Hurk P. Van Den (2012), Data Science and the Future of Statistics Presentation at the first Data Science NL meetup, http://www.slideshare.net/pietdaas/data-science-and-the-future-of-statistics.
Douglas L. (2001), 3d data management: Controlling data volume, velocity and variety, https://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf.
Glasson M., Trepanier J., Patruno V., Daas P., and Skaliotis M., Khan A. (2013), What does “Big Data” mean for Official Statistics? https://statswiki.unece.org/pages/viewpage.action?pageId=77170614&preview=/77170614/80805923/Big%20Data%20HLG%20Final%20Published%20Version.docx.
Gobble M. (2013), Big Data: the next big thing in innovation, Research-Technology Management, 56, 64–66.
Hastie T., Tibshirani R., and Friedman J. (2002), The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer, Stanford.
HDFS Architecture, https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html.
HDFS Users Guide, http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html.
Hive, https://hive.apache.org.
Jain, R., and Chlamtac, I. (1985). The P2 algorithm for dynamic calculation of quantiles and histograms without storing observations. Communications of the ACM, 28(10), 1076-1085.
Kelley, I., and Blumenstock, J. (2014). Computational challenges in the analysis of large, sparse, spatiotemporal data. In Proceedings of the sixth international workshop on Data intensive distributed computing, 41-46. ACM.
Klemens B. (2008), Modeling with Data: Tools and Techniques for Statistical Computing, Princeton University Press.
Manku, G. S., Rajagopalan, S., and Lindsay, B. G. (1998), Approximate medians and other quantiles in one pass and with limited memory. In ACM SIGMOD Record, 27 (2), 426-435.
MapReduce, https://wiki.apache.org/hadoop/MapReduce.
Mardia, K. V., and Zemroch, P. J. (1975), Algorithm AS 84: Measures of multivariate skewness and kurtosis. Journal of the Royal Statistical Society. Series C (Applied Statistics), 24(2), 262-265.
Munro J.J., and Paterson M.S. (1980), Selection and sorting with limited storage, Theor. Comput. Sci., 12, 315–323.
NIST/SEMATECH, e-Handbook of Statistical Methods, http://www.itl.nist.gov/div898/handbook/, [Accessed: 19.01.2018].
NSF (2012), Core techniques and technologies for advancing big data science and engineering (BIGDATA), https://www.nsf.gov/pubs/2012/nsf12499/nsf12499.htm.
Philippe P., Thompson D., Bennett J., and Mascarenhas A. (2011), Design and performance of a scalable, parallel statistics toolkit, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), 1475–1484.
Spark, https://spark.apache.org/documentation.html.
Sysoev O., Oleg B., and Grimvalla A. (2011), A segmentation-based algorithm for large-scale partially ordered monotonic regression, Comput. Stat.Data Anal, 55 (8), 2463–2476.
White T., (2012), Hadoop the Definitive Guide, 3rd Edition, O’Reilly Media.
Wilkinson L., (2008), The future of statistical computing, Technometrics, 50 (4), 418–435.
Yarn, https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html.

Toplam 29 adet kaynakça vardır.

Ayrıntılar

Birincil Dil	İngilizce
Bölüm	Makaleler
Yazarlar	Özgür Yılmazel 0000-0002-8932-9587
Yayımlanma Tarihi	30 Haziran 2019
Yayımlandığı Sayı	Yıl 2019 Cilt: 1 Sayı: 1

Kaynak Göster

APA	Yılmazel, Ö. (2019). SCALABLE IMPLEMENTATIONS OF DESCRIPTIVE STATISTICS ON HADOOP. Nicel Bilimler Dergisi, 1(1), 43-58.

Kapak Resmi İndir

Makale Dosyaları

Tam Metin