AN ANALYSIS OF APACHE SPARK AND GPU PERFORMANCES ON DATABASE SQL QUERIES FOR DISTRIBUTED NETWORKS

Mehmet Turan; Emin Tenekeci; Kemal Güner

doi:10.54365/adyumbd.1508182

Research Article

AN ANALYSIS OF APACHE SPARK AND GPU PERFORMANCES ON DATABASE SQL QUERIES FOR DISTRIBUTED NETWORKS

Year 2024, Volume: 11 Issue: 24, 428 - 437, 31.12.2024

Mehmet Turan , Emin Tenekeci , Kemal Güner

https://doi.org/10.54365/adyumbd.1508182

Abstract

The use of GPU in different fields and its successful results initiate efforts to use GPU in database systems. It is also effective in distributed systems and computer networks in that accelerates computational tasks by leveraging parallel processing capabilities across multiple nodes and for tasks that require high computational power, such as network traffic analysis and real-time data processing. Digital transformation in all areas of life has led to the emergence of needs such as increased data diversity and faster data analysis. Upgrading the hardware capacity of the system or software-based studies are possible solutions to analyze this data for meeting the needs. In this study, Apache Spark and GPU performance differences are examined in commonly used SQL queries on big data. In this context, SQL queries such as grouping, sorting, and filtering, which are commonly used in data analysis, are used. While the queries performed with the GPU showed similar results in simple queries compared to the queries performed with Apache Spark, the GPU was completed 3x faster in queries requiring calculation.

Keywords

GPU , Apache Spark , Distributed Networking , Big Data , HPC

References

Einav, L., and Levin, J., (2013). The Data Revolution and Economic Analysis. Innovation Policy and the Economy. 14: 1-24
Mishra, R., and Sharma, R, 2015. Big Data: Opportunities and Challenges. International Journal of Computer Science and Mobile Computing, 4(6): 27-35.
Mcdonald, C., 2018. Spark 101: What Is It, What It Does, and Why It Matters. https://mapr.com/blog/spark-101-what-it-what-it-does-and-why-it-matters
Ahmed, N., Barczak, A., L., C., Susnjak, T. A, (2020), and Rashid, M., A., ‘Comprehensive Performance Analysis of Apache Hadoop and Apache Spark for large scale data sets using HiBench.’ Journal of Big Data 7, article numer 110. https://doi.org/10.1186/s40537-020-00388-5
Kennedy, R., 2023, at Nasuni, The New Era of Big Data., ‘The_New_Era_of_Big_Data.’, (2023), https://www.forbes.com/councils/forbestechcouncil/2023/05/24/the-new-era-of-big-data/
Guner, K., and Kosar, T., In proceedings "Energy-Efficient Mobile Network I/O", IEEE Global Communications Conference (GLOBECOM), Abu Dhabi, United Arab Emirates, 2018, pp. 1-6.
Wang, K., Khan, and M.M.H., 2015. Performance Prediction For Apache Spark Platform, High Performance Computing and Communications (HPCC), IEEE 7th International Symposium On Cyberspace Safety and Security (CSS), IEEE 12th International Conference On Embedded Software and Systems (ICESS), IEEE 17th International Conference, New York, P.166-173.
Tang, S., He, B., Yu C., Li, Y. and Li, K., (2022), "A Survey on Spark Ecosystem: Big Data Processing Infrastructure, Machine Learning, and Applications," in IEEE Transactions on Knowledge and Data Engineering, vol. 34, no. 1, pp. 71-91, 1 Jan. 2022, doi: 10.1109/TKDE.2020.2975652.
Lunga, D., Gerrand, J., Yang, L., Layton, C., and Stewart, R., 2020, "Apache Spark Accelerated Deep Learning Inference for Large Scale Satellite Image Analytics," in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 13, pp. 271-283,
Tang Z., Zeng A., Zhang X., Yang L., and Li K., 2020, ‘Dynamic memory-aware scheduling in spark computing environment’ in J. Parallel. Distrib. Comput., 141 (2020), pp. 10-22
Şahin, H., Külür, and M. S., 2016. Real Time Orthorectification of Images by GPGPU Method, Harita Dergisi, 155: 12-22.
Tirmazi, M., Basat, R. B., Gao, J., and Yu, M., 2020. ‘Cheetah: Accelerating Database Queries with Switch Pruning.’ in Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (SIGMOD '20). Association for Computing Machinery, New York, NY, USA, pages 2407–2422, June 11, 2020, NY, USA, https://doi.org/10.1145/3318464.3389698
Guner, K., Nine, M., S., Z., Bulut, M., F., and Kosar, T., "Fasthla: Energy-efficient mobile data transfer optimization based on historical log analysis", Proceedings of the 16th ACM International Symposium on Mobility Management and Wireless Access, pp. 59-66, 2018.
Wrede, F., and Ernsting, S., 2017. Simultaneous CPU-GPU Execution of Data Parallel Algorithmic Skeletons. International Journal of Parallel Programming. 46(1): 42-61
Wolf, M., 2014. High-Performance Embedded Computing: Applications in Cyber-Physical Systems and Mobile Computing (Second Edition), publisher Morgan Kaufmann, P.59-138.
Lee, R., Zhou, M., Li, C., Hu, S., Teng, J., Li, D., and Zhang, X., 2021. ‘The art of balance: a RateupDB™ experience of building a CPU/GPU hybrid database product.’ Proc. VLDB Endow. 14, 12 (July 2021), 2999–3013. https://doi.org/10.14778/3476311.3476378
Nvidia, 2011a. Cuda C Programming Guide. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
Nvidia, 2011b. Energy Efficiency. http://www.nvidia.com/object/gcr-energy-efficiency.html
Opher, A., Chou, A., and Onda, A., 2015. The Rise of the Data Economy:Driving Value Through Internet of Things Data Monetization. https://www.ibm.com/downloads/cas/4jroldq7
Farooquı, N., Roy, I., Chen, Y., Talwar, V., Barık, R., Lewıs, B.T., Shpeısman, T., and Schwan, K., 2016. Accelerating Data Analytics on Integrated Gpu Platforms via Runtime Specialization. International Journal of Parallel Programming, 46: 336-375.
He B., Yang, K., Fang, R., Lu, M., Govindaraju, N. K., Luo, Q., and Sander, P., (2008). Relational Joins on Graphics Processors. Proceedings of the Acm Sigmod International Conference on Management of Data, Sıgmod 2008, Vancouver, 10-12 June 2008, P.511-524.
Bakkum, P., and Skadron, K., 2010. Accelerating Sql Database Operations on A Gpu with Cuda. In Gpgpu '10: Proceedings of the Third Workshop on General-Purpose Computation on Graphics Processing Units, Newyork P.94-103.
Wu, H., 2015. Acceleration and Execution of Relational Queries Using General Purpose Graphics Processing Unit (Gpgpu). Georgia Institute of Technology, School of Electrical and Computer Engineering, Doctor of Philosophy, 134p.
Breß, S., Schallehn, E., and Geist I., 2013. Towards Optimization of Hybrid Cpu/Gpu Query Plans İn Database Systems. Workshop Proceedings of the 16th East European Conference, Pozna, 17-21 September 2012, P. 27-35.
Ilić, A., Pratas, F., Trancoso, P., and Sousa, L., 2011. High-Performance Computing on Heterogeneous Systems: Database Queries on CPU and GPU.High Performance Scientific Computing with Special Emphasis on Current Capabilities and Future Perspectives, P.202–222.
Li, P., Luo, Y., Zhang, N., and Cao, Y., 2015. Heterospark: A Heterogeneous Cpu/Gpu Spark Platform for Machine Learning Algorithms. Conference: International Conference of Networking Architecture and Storage, Boston, 6-7 August 2015, P.347-348.
Yuan, Y., Lee, R., and Zhang, X., 2013. The Yin and Yang of Processing Data Warehousing Queries on Gpu Devices. Proceedings of The Vldb Endowment, 6(10): 817-828.
Khourdifi, Y., Elalami, A., Bahaj, M., Zaydi, M., Er-Remyly, O., Chapter 9 - Framework for integrating healthcare big data using IoMT technology, In Advances in ubiquitous sensing applications for healthcare, Computational Intelligence for Medical Internet of Things (MIoT) Applications, Academic Press, Volume 14, 2023, Pages 191-210

DAĞITIK AĞLAR İÇİN VERİ TABANI SQL SORGULARINDA APACHE SPARK VE GPU PERFORMANSLARININ ANALİZİ

Year 2024, Volume: 11 Issue: 24, 428 - 437, 31.12.2024

Mehmet Turan , Emin Tenekeci , Kemal Güner

https://doi.org/10.54365/adyumbd.1508182

Abstract

Her geçen gün farklı bir alanda kullanılmaya başlanan ve başarılı sonuçlar sergileyen GPU’nun veri tabanlarında kullanılmasına yönelik çalışmalar giderek yaygınlaşmaktadır. Ayrıca dağıtık sistemlerde ve bilgisayar ağlarında da, birden fazla düğümde paralel işleme yeteneklerinden yararlanarak hesaplama görevlerini hızlandırmaya ve yüksek hesaplama gücü gerektiren ağ trafiği analizi, gerçek zamanlı veri işleme gibi görevlerde etkin olmaktadır. Hayatın her alanda gerçekleşen dijital dönüşüm veri çeşitliliğinde artış, verilerin daha hızlı analiz edilebilmesi vb. ihtiyaçların ortaya çıkmasına neden olmuştur. Bu verilerin analizi için sistem donanım kapasitesinin artırılması veya yazılım temelli çalışmalar ile ihtiyaçların karşılanabilmesine yönelik çözümler bulunmaktadır. Bu çalışmada ise büyük verilerde Apache Spark ve GPU’nun yaygın olarak kullanılan SQL sorgularındaki performans farklılıkları incelenmiştir. Bu kapsamda veri analizinde genel olarak kullanılan gruplandırma, sıralama ve filtreleme gibi SQL sorguları kullanılmıştır. GPU ile gerçekleştirilen sorguların Apache Spark ile gerçekleştirilen sorgulara göre basit sorgularda benzer sonuçlar sergilerken, hesaplama gerektiren sorgularda GPU’nun 3x kadar daha kısa sürede sonuçlandırmıştır.

Keywords

GPU , Apache Spark , Dağıtık Ağlar , HPC , Büyük Veri

References

Einav, L., and Levin, J., (2013). The Data Revolution and Economic Analysis. Innovation Policy and the Economy. 14: 1-24
Mishra, R., and Sharma, R, 2015. Big Data: Opportunities and Challenges. International Journal of Computer Science and Mobile Computing, 4(6): 27-35.
Mcdonald, C., 2018. Spark 101: What Is It, What It Does, and Why It Matters. https://mapr.com/blog/spark-101-what-it-what-it-does-and-why-it-matters
Ahmed, N., Barczak, A., L., C., Susnjak, T. A, (2020), and Rashid, M., A., ‘Comprehensive Performance Analysis of Apache Hadoop and Apache Spark for large scale data sets using HiBench.’ Journal of Big Data 7, article numer 110. https://doi.org/10.1186/s40537-020-00388-5
Kennedy, R., 2023, at Nasuni, The New Era of Big Data., ‘The_New_Era_of_Big_Data.’, (2023), https://www.forbes.com/councils/forbestechcouncil/2023/05/24/the-new-era-of-big-data/
Guner, K., and Kosar, T., In proceedings "Energy-Efficient Mobile Network I/O", IEEE Global Communications Conference (GLOBECOM), Abu Dhabi, United Arab Emirates, 2018, pp. 1-6.
Wang, K., Khan, and M.M.H., 2015. Performance Prediction For Apache Spark Platform, High Performance Computing and Communications (HPCC), IEEE 7th International Symposium On Cyberspace Safety and Security (CSS), IEEE 12th International Conference On Embedded Software and Systems (ICESS), IEEE 17th International Conference, New York, P.166-173.
Tang, S., He, B., Yu C., Li, Y. and Li, K., (2022), "A Survey on Spark Ecosystem: Big Data Processing Infrastructure, Machine Learning, and Applications," in IEEE Transactions on Knowledge and Data Engineering, vol. 34, no. 1, pp. 71-91, 1 Jan. 2022, doi: 10.1109/TKDE.2020.2975652.
Lunga, D., Gerrand, J., Yang, L., Layton, C., and Stewart, R., 2020, "Apache Spark Accelerated Deep Learning Inference for Large Scale Satellite Image Analytics," in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 13, pp. 271-283,
Tang Z., Zeng A., Zhang X., Yang L., and Li K., 2020, ‘Dynamic memory-aware scheduling in spark computing environment’ in J. Parallel. Distrib. Comput., 141 (2020), pp. 10-22
Şahin, H., Külür, and M. S., 2016. Real Time Orthorectification of Images by GPGPU Method, Harita Dergisi, 155: 12-22.
Tirmazi, M., Basat, R. B., Gao, J., and Yu, M., 2020. ‘Cheetah: Accelerating Database Queries with Switch Pruning.’ in Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (SIGMOD '20). Association for Computing Machinery, New York, NY, USA, pages 2407–2422, June 11, 2020, NY, USA, https://doi.org/10.1145/3318464.3389698
Guner, K., Nine, M., S., Z., Bulut, M., F., and Kosar, T., "Fasthla: Energy-efficient mobile data transfer optimization based on historical log analysis", Proceedings of the 16th ACM International Symposium on Mobility Management and Wireless Access, pp. 59-66, 2018.
Wrede, F., and Ernsting, S., 2017. Simultaneous CPU-GPU Execution of Data Parallel Algorithmic Skeletons. International Journal of Parallel Programming. 46(1): 42-61
Wolf, M., 2014. High-Performance Embedded Computing: Applications in Cyber-Physical Systems and Mobile Computing (Second Edition), publisher Morgan Kaufmann, P.59-138.
Lee, R., Zhou, M., Li, C., Hu, S., Teng, J., Li, D., and Zhang, X., 2021. ‘The art of balance: a RateupDB™ experience of building a CPU/GPU hybrid database product.’ Proc. VLDB Endow. 14, 12 (July 2021), 2999–3013. https://doi.org/10.14778/3476311.3476378
Nvidia, 2011a. Cuda C Programming Guide. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
Nvidia, 2011b. Energy Efficiency. http://www.nvidia.com/object/gcr-energy-efficiency.html
Opher, A., Chou, A., and Onda, A., 2015. The Rise of the Data Economy:Driving Value Through Internet of Things Data Monetization. https://www.ibm.com/downloads/cas/4jroldq7
Farooquı, N., Roy, I., Chen, Y., Talwar, V., Barık, R., Lewıs, B.T., Shpeısman, T., and Schwan, K., 2016. Accelerating Data Analytics on Integrated Gpu Platforms via Runtime Specialization. International Journal of Parallel Programming, 46: 336-375.
He B., Yang, K., Fang, R., Lu, M., Govindaraju, N. K., Luo, Q., and Sander, P., (2008). Relational Joins on Graphics Processors. Proceedings of the Acm Sigmod International Conference on Management of Data, Sıgmod 2008, Vancouver, 10-12 June 2008, P.511-524.
Bakkum, P., and Skadron, K., 2010. Accelerating Sql Database Operations on A Gpu with Cuda. In Gpgpu '10: Proceedings of the Third Workshop on General-Purpose Computation on Graphics Processing Units, Newyork P.94-103.
Wu, H., 2015. Acceleration and Execution of Relational Queries Using General Purpose Graphics Processing Unit (Gpgpu). Georgia Institute of Technology, School of Electrical and Computer Engineering, Doctor of Philosophy, 134p.
Breß, S., Schallehn, E., and Geist I., 2013. Towards Optimization of Hybrid Cpu/Gpu Query Plans İn Database Systems. Workshop Proceedings of the 16th East European Conference, Pozna, 17-21 September 2012, P. 27-35.
Ilić, A., Pratas, F., Trancoso, P., and Sousa, L., 2011. High-Performance Computing on Heterogeneous Systems: Database Queries on CPU and GPU.High Performance Scientific Computing with Special Emphasis on Current Capabilities and Future Perspectives, P.202–222.
Li, P., Luo, Y., Zhang, N., and Cao, Y., 2015. Heterospark: A Heterogeneous Cpu/Gpu Spark Platform for Machine Learning Algorithms. Conference: International Conference of Networking Architecture and Storage, Boston, 6-7 August 2015, P.347-348.
Yuan, Y., Lee, R., and Zhang, X., 2013. The Yin and Yang of Processing Data Warehousing Queries on Gpu Devices. Proceedings of The Vldb Endowment, 6(10): 817-828.
Khourdifi, Y., Elalami, A., Bahaj, M., Zaydi, M., Er-Remyly, O., Chapter 9 - Framework for integrating healthcare big data using IoMT technology, In Advances in ubiquitous sensing applications for healthcare, Computational Intelligence for Medical Internet of Things (MIoT) Applications, Academic Press, Volume 14, 2023, Pages 191-210

There are 28 citations in total.

Details

Primary Language	English
Subjects	Computer Vision and Multimedia Computation (Other), Concurrent/Parallel Systems and Technologies, Performance Evaluation, High Performance Computing, Machine Learning (Other)
Journal Section	Research Article
Authors	Mehmet Turan 0000-0002-8038-0749 Emin Tenekeci 0000-0001-5944-4704 Kemal Güner 0000-0003-3495-9044
Early Pub Date	December 29, 2024
Publication Date	December 31, 2024
Submission Date	July 1, 2024
Acceptance Date	October 3, 2024
Published in Issue	Year 2024 Volume: 11 Issue: 24

Cite

APA	Turan, M., Tenekeci, E., & Güner, K. (2024). AN ANALYSIS OF APACHE SPARK AND GPU PERFORMANCES ON DATABASE SQL QUERIES FOR DISTRIBUTED NETWORKS. Adıyaman Üniversitesi Mühendislik Bilimleri Dergisi, 11(24), 428-437. https://doi.org/10.54365/adyumbd.1508182
AMA	Turan M, Tenekeci E, Güner K. AN ANALYSIS OF APACHE SPARK AND GPU PERFORMANCES ON DATABASE SQL QUERIES FOR DISTRIBUTED NETWORKS. Adıyaman Üniversitesi Mühendislik Bilimleri Dergisi. December 2024;11(24):428-437. doi:10.54365/adyumbd.1508182
Chicago	Turan, Mehmet, Emin Tenekeci, and Kemal Güner. “AN ANALYSIS OF APACHE SPARK AND GPU PERFORMANCES ON DATABASE SQL QUERIES FOR DISTRIBUTED NETWORKS”. Adıyaman Üniversitesi Mühendislik Bilimleri Dergisi 11, no. 24 (December 2024): 428-37. https://doi.org/10.54365/adyumbd.1508182.
EndNote	Turan M, Tenekeci E, Güner K (December 1, 2024) AN ANALYSIS OF APACHE SPARK AND GPU PERFORMANCES ON DATABASE SQL QUERIES FOR DISTRIBUTED NETWORKS. Adıyaman Üniversitesi Mühendislik Bilimleri Dergisi 11 24 428–437.
IEEE	M. Turan, E. Tenekeci, and K. Güner, “AN ANALYSIS OF APACHE SPARK AND GPU PERFORMANCES ON DATABASE SQL QUERIES FOR DISTRIBUTED NETWORKS”, Adıyaman Üniversitesi Mühendislik Bilimleri Dergisi, vol. 11, no. 24, pp. 428–437, 2024, doi: 10.54365/adyumbd.1508182.
ISNAD	Turan, Mehmet et al. “AN ANALYSIS OF APACHE SPARK AND GPU PERFORMANCES ON DATABASE SQL QUERIES FOR DISTRIBUTED NETWORKS”. Adıyaman Üniversitesi Mühendislik Bilimleri Dergisi 11/24 (December2024), 428-437. https://doi.org/10.54365/adyumbd.1508182.
JAMA	Turan M, Tenekeci E, Güner K. AN ANALYSIS OF APACHE SPARK AND GPU PERFORMANCES ON DATABASE SQL QUERIES FOR DISTRIBUTED NETWORKS. Adıyaman Üniversitesi Mühendislik Bilimleri Dergisi. 2024;11:428–437.
MLA	Turan, Mehmet et al. “AN ANALYSIS OF APACHE SPARK AND GPU PERFORMANCES ON DATABASE SQL QUERIES FOR DISTRIBUTED NETWORKS”. Adıyaman Üniversitesi Mühendislik Bilimleri Dergisi, vol. 11, no. 24, 2024, pp. 428-37, doi:10.54365/adyumbd.1508182.
Vancouver	Turan M, Tenekeci E, Güner K. AN ANALYSIS OF APACHE SPARK AND GPU PERFORMANCES ON DATABASE SQL QUERIES FOR DISTRIBUTED NETWORKS. Adıyaman Üniversitesi Mühendislik Bilimleri Dergisi. 2024;11(24):428-37.

Download Cover Image

Article Files

Full Text