DAĞITIK AĞLAR İÇİN VERİ TABANI SQL SORGULARINDA APACHE SPARK VE GPU PERFORMANSLARININ ANALİZİ
Year 2024,
Volume: 11 Issue: 24, 428 - 437, 31.12.2024
Mehmet Turan
,
Emin Tenekeci
,
Kemal Güner
Abstract
Her geçen gün farklı bir alanda kullanılmaya başlanan ve başarılı sonuçlar sergileyen GPU’nun veri tabanlarında kullanılmasına yönelik çalışmalar giderek yaygınlaşmaktadır. Ayrıca dağıtık sistemlerde ve bilgisayar ağlarında da, birden fazla düğümde paralel işleme yeteneklerinden yararlanarak hesaplama görevlerini hızlandırmaya ve yüksek hesaplama gücü gerektiren ağ trafiği analizi, gerçek zamanlı veri işleme gibi görevlerde etkin olmaktadır. Hayatın her alanda gerçekleşen dijital dönüşüm veri çeşitliliğinde artış, verilerin daha hızlı analiz edilebilmesi vb. ihtiyaçların ortaya çıkmasına neden olmuştur. Bu verilerin analizi için sistem donanım kapasitesinin artırılması veya yazılım temelli çalışmalar ile ihtiyaçların karşılanabilmesine yönelik çözümler bulunmaktadır. Bu çalışmada ise büyük verilerde Apache Spark ve GPU’nun yaygın olarak kullanılan SQL sorgularındaki performans farklılıkları incelenmiştir. Bu kapsamda veri analizinde genel olarak kullanılan gruplandırma, sıralama ve filtreleme gibi SQL sorguları kullanılmıştır. GPU ile gerçekleştirilen sorguların Apache Spark ile gerçekleştirilen sorgulara göre basit sorgularda benzer sonuçlar sergilerken, hesaplama gerektiren sorgularda GPU’nun 3x kadar daha kısa sürede sonuçlandırmıştır.
References
- Einav, L., and Levin, J., (2013). The Data Revolution and Economic Analysis. Innovation Policy and the Economy. 14: 1-24
- Mishra, R., and Sharma, R, 2015. Big Data: Opportunities and Challenges. International Journal of Computer Science and Mobile Computing, 4(6): 27-35.
- Mcdonald, C., 2018. Spark 101: What Is It, What It Does, and Why It Matters. https://mapr.com/blog/spark-101-what-it-what-it-does-and-why-it-matters
- Ahmed, N., Barczak, A., L., C., Susnjak, T. A, (2020), and Rashid, M., A., ‘Comprehensive Performance Analysis of Apache Hadoop and Apache Spark for large scale data sets using HiBench.’ Journal of Big Data 7, article numer 110. https://doi.org/10.1186/s40537-020-00388-5
- Kennedy, R., 2023, at Nasuni, The New Era of Big Data., ‘The_New_Era_of_Big_Data.’, (2023), https://www.forbes.com/councils/forbestechcouncil/2023/05/24/the-new-era-of-big-data/
- Guner, K., and Kosar, T., In proceedings "Energy-Efficient Mobile Network I/O", IEEE Global Communications Conference (GLOBECOM), Abu Dhabi, United Arab Emirates, 2018, pp. 1-6.
- Wang, K., Khan, and M.M.H., 2015. Performance Prediction For Apache Spark Platform, High Performance Computing and Communications (HPCC), IEEE 7th International Symposium On Cyberspace Safety and Security (CSS), IEEE 12th International Conference On Embedded Software and Systems (ICESS), IEEE 17th International Conference, New York, P.166-173.
- Tang, S., He, B., Yu C., Li, Y. and Li, K., (2022), "A Survey on Spark Ecosystem: Big Data Processing Infrastructure, Machine Learning, and Applications," in IEEE Transactions on Knowledge and Data Engineering, vol. 34, no. 1, pp. 71-91, 1 Jan. 2022, doi: 10.1109/TKDE.2020.2975652.
- Lunga, D., Gerrand, J., Yang, L., Layton, C., and Stewart, R., 2020, "Apache Spark Accelerated Deep Learning Inference for Large Scale Satellite Image Analytics," in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 13, pp. 271-283,
- Tang Z., Zeng A., Zhang X., Yang L., and Li K., 2020, ‘Dynamic memory-aware scheduling in spark computing environment’ in J. Parallel. Distrib. Comput., 141 (2020), pp. 10-22
- Şahin, H., Külür, and M. S., 2016. Real Time Orthorectification of Images by GPGPU Method, Harita Dergisi, 155: 12-22.
- Tirmazi, M., Basat, R. B., Gao, J., and Yu, M., 2020. ‘Cheetah: Accelerating Database Queries with Switch Pruning.’ in Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (SIGMOD '20). Association for Computing Machinery, New York, NY, USA, pages 2407–2422, June 11, 2020, NY, USA, https://doi.org/10.1145/3318464.3389698
- Guner, K., Nine, M., S., Z., Bulut, M., F., and Kosar, T., "Fasthla: Energy-efficient mobile data transfer optimization based on historical log analysis", Proceedings of the 16th ACM International Symposium on Mobility Management and Wireless Access, pp. 59-66, 2018.
- Wrede, F., and Ernsting, S., 2017. Simultaneous CPU-GPU Execution of Data Parallel Algorithmic Skeletons. International Journal of Parallel Programming. 46(1): 42-61
- Wolf, M., 2014. High-Performance Embedded Computing: Applications in Cyber-Physical Systems and Mobile Computing (Second Edition), publisher Morgan Kaufmann, P.59-138.
- Lee, R., Zhou, M., Li, C., Hu, S., Teng, J., Li, D., and Zhang, X., 2021. ‘The art of balance: a RateupDB™ experience of building a CPU/GPU hybrid database product.’ Proc. VLDB Endow. 14, 12 (July 2021), 2999–3013. https://doi.org/10.14778/3476311.3476378
- Nvidia, 2011a. Cuda C Programming Guide. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
- Nvidia, 2011b. Energy Efficiency. http://www.nvidia.com/object/gcr-energy-efficiency.html
- Opher, A., Chou, A., and Onda, A., 2015. The Rise of the Data Economy:Driving Value Through Internet of Things Data Monetization. https://www.ibm.com/downloads/cas/4jroldq7
- Farooquı, N., Roy, I., Chen, Y., Talwar, V., Barık, R., Lewıs, B.T., Shpeısman, T., and Schwan, K., 2016. Accelerating Data Analytics on Integrated Gpu Platforms via Runtime Specialization. International Journal of Parallel Programming, 46: 336-375.
- He B., Yang, K., Fang, R., Lu, M., Govindaraju, N. K., Luo, Q., and Sander, P., (2008). Relational Joins on Graphics Processors. Proceedings of the Acm Sigmod International Conference on Management of Data, Sıgmod 2008, Vancouver, 10-12 June 2008, P.511-524.
- Bakkum, P., and Skadron, K., 2010. Accelerating Sql Database Operations on A Gpu with Cuda. In Gpgpu '10: Proceedings of the Third Workshop on General-Purpose Computation on Graphics Processing Units, Newyork P.94-103.
- Wu, H., 2015. Acceleration and Execution of Relational Queries Using General Purpose Graphics Processing Unit (Gpgpu). Georgia Institute of Technology, School of Electrical and Computer Engineering, Doctor of Philosophy, 134p.
- Breß, S., Schallehn, E., and Geist I., 2013. Towards Optimization of Hybrid Cpu/Gpu Query Plans İn Database Systems. Workshop Proceedings of the 16th East European Conference, Pozna, 17-21 September 2012, P. 27-35.
- Ilić, A., Pratas, F., Trancoso, P., and Sousa, L., 2011. High-Performance Computing on Heterogeneous Systems: Database Queries on CPU and GPU.High Performance Scientific Computing with Special Emphasis on Current Capabilities and Future Perspectives, P.202–222.
- Li, P., Luo, Y., Zhang, N., and Cao, Y., 2015. Heterospark: A Heterogeneous Cpu/Gpu Spark Platform for Machine Learning Algorithms. Conference: International Conference of Networking Architecture and Storage, Boston, 6-7 August 2015, P.347-348.
- Yuan, Y., Lee, R., and Zhang, X., 2013. The Yin and Yang of Processing Data Warehousing Queries on Gpu Devices. Proceedings of The Vldb Endowment, 6(10): 817-828.
- Khourdifi, Y., Elalami, A., Bahaj, M., Zaydi, M., Er-Remyly, O., Chapter 9 - Framework for integrating healthcare big data using IoMT technology, In Advances in ubiquitous sensing applications for healthcare, Computational Intelligence for Medical Internet of Things (MIoT) Applications, Academic Press, Volume 14, 2023, Pages 191-210
AN ANALYSIS OF APACHE SPARK AND GPU PERFORMANCES ON DATABASE SQL QUERIES FOR DISTRIBUTED NETWORKS
Year 2024,
Volume: 11 Issue: 24, 428 - 437, 31.12.2024
Mehmet Turan
,
Emin Tenekeci
,
Kemal Güner
Abstract
The use of GPU in different fields and its successful results initiate efforts to use GPU in database systems. It is also effective in distributed systems and computer networks in that accelerates computational tasks by leveraging parallel processing capabilities across multiple nodes and for tasks that require high computational power, such as network traffic analysis and real-time data processing. Digital transformation in all areas of life has led to the emergence of needs such as increased data diversity and faster data analysis. Upgrading the hardware capacity of the system or software-based studies are possible solutions to analyze this data for meeting the needs. In this study, Apache Spark and GPU performance differences are examined in commonly used SQL queries on big data. In this context, SQL queries such as grouping, sorting, and filtering, which are commonly used in data analysis, are used. While the queries performed with the GPU showed similar results in simple queries compared to the queries performed with Apache Spark, the GPU was completed 3x faster in queries requiring calculation.
References
- Einav, L., and Levin, J., (2013). The Data Revolution and Economic Analysis. Innovation Policy and the Economy. 14: 1-24
- Mishra, R., and Sharma, R, 2015. Big Data: Opportunities and Challenges. International Journal of Computer Science and Mobile Computing, 4(6): 27-35.
- Mcdonald, C., 2018. Spark 101: What Is It, What It Does, and Why It Matters. https://mapr.com/blog/spark-101-what-it-what-it-does-and-why-it-matters
- Ahmed, N., Barczak, A., L., C., Susnjak, T. A, (2020), and Rashid, M., A., ‘Comprehensive Performance Analysis of Apache Hadoop and Apache Spark for large scale data sets using HiBench.’ Journal of Big Data 7, article numer 110. https://doi.org/10.1186/s40537-020-00388-5
- Kennedy, R., 2023, at Nasuni, The New Era of Big Data., ‘The_New_Era_of_Big_Data.’, (2023), https://www.forbes.com/councils/forbestechcouncil/2023/05/24/the-new-era-of-big-data/
- Guner, K., and Kosar, T., In proceedings "Energy-Efficient Mobile Network I/O", IEEE Global Communications Conference (GLOBECOM), Abu Dhabi, United Arab Emirates, 2018, pp. 1-6.
- Wang, K., Khan, and M.M.H., 2015. Performance Prediction For Apache Spark Platform, High Performance Computing and Communications (HPCC), IEEE 7th International Symposium On Cyberspace Safety and Security (CSS), IEEE 12th International Conference On Embedded Software and Systems (ICESS), IEEE 17th International Conference, New York, P.166-173.
- Tang, S., He, B., Yu C., Li, Y. and Li, K., (2022), "A Survey on Spark Ecosystem: Big Data Processing Infrastructure, Machine Learning, and Applications," in IEEE Transactions on Knowledge and Data Engineering, vol. 34, no. 1, pp. 71-91, 1 Jan. 2022, doi: 10.1109/TKDE.2020.2975652.
- Lunga, D., Gerrand, J., Yang, L., Layton, C., and Stewart, R., 2020, "Apache Spark Accelerated Deep Learning Inference for Large Scale Satellite Image Analytics," in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 13, pp. 271-283,
- Tang Z., Zeng A., Zhang X., Yang L., and Li K., 2020, ‘Dynamic memory-aware scheduling in spark computing environment’ in J. Parallel. Distrib. Comput., 141 (2020), pp. 10-22
- Şahin, H., Külür, and M. S., 2016. Real Time Orthorectification of Images by GPGPU Method, Harita Dergisi, 155: 12-22.
- Tirmazi, M., Basat, R. B., Gao, J., and Yu, M., 2020. ‘Cheetah: Accelerating Database Queries with Switch Pruning.’ in Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (SIGMOD '20). Association for Computing Machinery, New York, NY, USA, pages 2407–2422, June 11, 2020, NY, USA, https://doi.org/10.1145/3318464.3389698
- Guner, K., Nine, M., S., Z., Bulut, M., F., and Kosar, T., "Fasthla: Energy-efficient mobile data transfer optimization based on historical log analysis", Proceedings of the 16th ACM International Symposium on Mobility Management and Wireless Access, pp. 59-66, 2018.
- Wrede, F., and Ernsting, S., 2017. Simultaneous CPU-GPU Execution of Data Parallel Algorithmic Skeletons. International Journal of Parallel Programming. 46(1): 42-61
- Wolf, M., 2014. High-Performance Embedded Computing: Applications in Cyber-Physical Systems and Mobile Computing (Second Edition), publisher Morgan Kaufmann, P.59-138.
- Lee, R., Zhou, M., Li, C., Hu, S., Teng, J., Li, D., and Zhang, X., 2021. ‘The art of balance: a RateupDB™ experience of building a CPU/GPU hybrid database product.’ Proc. VLDB Endow. 14, 12 (July 2021), 2999–3013. https://doi.org/10.14778/3476311.3476378
- Nvidia, 2011a. Cuda C Programming Guide. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
- Nvidia, 2011b. Energy Efficiency. http://www.nvidia.com/object/gcr-energy-efficiency.html
- Opher, A., Chou, A., and Onda, A., 2015. The Rise of the Data Economy:Driving Value Through Internet of Things Data Monetization. https://www.ibm.com/downloads/cas/4jroldq7
- Farooquı, N., Roy, I., Chen, Y., Talwar, V., Barık, R., Lewıs, B.T., Shpeısman, T., and Schwan, K., 2016. Accelerating Data Analytics on Integrated Gpu Platforms via Runtime Specialization. International Journal of Parallel Programming, 46: 336-375.
- He B., Yang, K., Fang, R., Lu, M., Govindaraju, N. K., Luo, Q., and Sander, P., (2008). Relational Joins on Graphics Processors. Proceedings of the Acm Sigmod International Conference on Management of Data, Sıgmod 2008, Vancouver, 10-12 June 2008, P.511-524.
- Bakkum, P., and Skadron, K., 2010. Accelerating Sql Database Operations on A Gpu with Cuda. In Gpgpu '10: Proceedings of the Third Workshop on General-Purpose Computation on Graphics Processing Units, Newyork P.94-103.
- Wu, H., 2015. Acceleration and Execution of Relational Queries Using General Purpose Graphics Processing Unit (Gpgpu). Georgia Institute of Technology, School of Electrical and Computer Engineering, Doctor of Philosophy, 134p.
- Breß, S., Schallehn, E., and Geist I., 2013. Towards Optimization of Hybrid Cpu/Gpu Query Plans İn Database Systems. Workshop Proceedings of the 16th East European Conference, Pozna, 17-21 September 2012, P. 27-35.
- Ilić, A., Pratas, F., Trancoso, P., and Sousa, L., 2011. High-Performance Computing on Heterogeneous Systems: Database Queries on CPU and GPU.High Performance Scientific Computing with Special Emphasis on Current Capabilities and Future Perspectives, P.202–222.
- Li, P., Luo, Y., Zhang, N., and Cao, Y., 2015. Heterospark: A Heterogeneous Cpu/Gpu Spark Platform for Machine Learning Algorithms. Conference: International Conference of Networking Architecture and Storage, Boston, 6-7 August 2015, P.347-348.
- Yuan, Y., Lee, R., and Zhang, X., 2013. The Yin and Yang of Processing Data Warehousing Queries on Gpu Devices. Proceedings of The Vldb Endowment, 6(10): 817-828.
- Khourdifi, Y., Elalami, A., Bahaj, M., Zaydi, M., Er-Remyly, O., Chapter 9 - Framework for integrating healthcare big data using IoMT technology, In Advances in ubiquitous sensing applications for healthcare, Computational Intelligence for Medical Internet of Things (MIoT) Applications, Academic Press, Volume 14, 2023, Pages 191-210