Quantitative Performance Analysis of BLAS Libraries on GPU Architectures

Işıl Öz

doi:10.21205/deufmd.2024267606

Research Article

Quantitative Performance Analysis of BLAS Libraries on GPU Architectures

Year 2024, Volume: 26 Issue: 76, 40 - 48, 23.01.2024

Işıl Öz

https://doi.org/10.21205/deufmd.2024267606

Abstract

Basic Linear Algebra Subprograms (BLAS) are a set of linear algebra routines commonly used by machine learning applications and scientific computing. BLAS libraries with optimized implementations of BLAS routines offer high performance by exploiting parallel execution units in target computing systems. With massively large number of cores, graphics processing units (GPUs) exhibit high performance for computationally-heavy workloads. Recent BLAS libraries utilize parallel cores of GPU architectures efficiently by employing inherent data parallelism. In this study, we analyze GPU-targeted functions from two BLAS libraries, cuBLAS and MAGMA, and evaluate their performance on a single-GPU NVIDIA architecture by considering architectural features and limitations. We collect architectural performance metrics and explore resource utilization characteristics. Our work aims to help researchers and programmers to understand the performance behavior and GPU resource utilization of the BLAS routines implemented by the libraries.

Keywords

Basic linear algebra subprograms, Graphics processing units, Performance analysis

References

Y. Wang, A. Davidson, Y. Pan, Y. Wu, A. Riffel, J. D. Owens, 2016. Gunrock: A high-performance graph processing library on the gpu, ACM SIGPLAN Notices, 51 (8), 1–12. DOI: 10.1145/3016078.2851145
S. Le Grand, A. W. Götz, R. C. Walker, 2013. Spfp: Speed without compromise—a mixed precision model for gpu accelerated molecular dynamics simulations, Computer Physics Communications 184 (2), 374–380. DOI: 10.1016/j.cpc.2012.09.022
A. Zeni, G. Guidi, M. Ellis, N. Ding, M. D. Santambrogio, S. A. Hofmeyr, A. Buluc ̧, L. Oliker, K. A. Yelick, 2020. LOGAN: high-performance gpu-based x-drop long-read alignment, IEEE International Parallel and Distributed Processing Symposium (IPDPS).
F. F. d. Santos, P. F. Pimenta, C. Lunardi, L. Draghetti, L. Carro, D. Kaeli, P. Rech, 2019. Analyzing and increasing the reliability of convolutional neural networks on gpus, IEEE Transactions on Reliability 68 (2), 663–677. DOI: 10.1109/TR.2018.2878387
S. Alcaide, L. Kosmidis, H. Tabani, C. Hernandez, J. Abella, F. J. Cazorla, 2018. Safety-related challenges and opportunities for gpus in the automotive domain, IEEE Micro 38 (6), 46–55. DOI: 10.1109/MM.2018.2873870
M. Benito, M. M. Trompouki, L. Kosmidis, J. D. Garcia, S. Carretero, K. Wenger, 2021. Comparison of gpu computing methodologies for safety-critical systems: An avionics case study, Design, Automation Test in Europe Conference Exhibition (DATE).
S. Kestur, J. D. Davis, O. Williams, 2010. Blas comparison on fpga, cpu and gpu, IEEE Computer Society Annual Symposium on VLSI.
A. A. Awan, H. Subramoni, D. K. Panda, 2017. An in-depth performance characterization of cpu- and gpu-based dnn training on modern architectures, Proceedings of the Machine Learning on HPC Environments (MLHPC).
A. Abdelfattah, D. Keyes, H. Ltaief, 2016. Kblas: An optimized library for dense matrix-vector multiplication on gpu accelerators, ACM Trans.Math. Softw. 42 (3), 1-31. DOI: 10.1145/2818311
cublas: Basic linear algebra on nvidia gpus. https://developer.nvidia.com/cublas (Access Date: January 2023).
C. Brown, A. Abdelfattah, S. Tomov, J. Dongarra, 2020. Design, optimization, and benchmarking of dense linear algebra algorithms on amd gpus, IEEE High Performance Extreme Computing Conference (HPEC).
rocblas user guide. https://rocblas.readthedocs.io/ (Access Date: January 2023).
J. Dongarra, M. Gates, A. Haidar, J. Kurzak, P. Luszczek, S. Tomov, I. Yamazaki, 2014. Accelerating numerical dense linear algebra calculations with gpus, Numerical Computations with GPUs, Springer, Cham.
L. Wang, W. Wu, Z. Xu, J. Xiao, Y. Yang, 2016. Blasx: A high performance level-3 blas library for heterogeneous multi-gpu computing, Proceedings of the International Conference on Supercomputing, (ICS).
C. Nugteren, 2018. Clblast: A tuned opencl blas library, International Workshop on OpenCL (IWOCL).
J. Dongarra, S. Hammarling, N. J. Higham, S. D. Relton, P. Valero-Lara, M. Zounon, 2017. The design and performance of batched blas on modern high-performance computing systems, Procedia Computer Science 108, 495–504. DOI: 10.1016/j.procs.2017.05.138
F. Li, Y. Ye, Z. Tian, X. Zhang, 2019. CPU versus GPU: which can perform matrix computation faster - performance comparison for basic linear algebra subprograms, Neural Comput. Appl. 31 (8), 4353–4365. DOI: 10.1007/s00521-018-3354-z
S. Ganeshan, N. K. Elumalai, R. Achar, 2020. A comparative study of magma and cublas libraries for gpu based vector fitting, IEEE 11th Latin American Symposium on Circuits Systems (LASCAS).
J. J. Dongarra, J. Du Croz, S. Hammarling, I. S. Duff, 1990. A set of level 3 basic linear algebra subprograms, ACM Transactions on Mathematical Software, 16 (1), 1–17 16 (1). DOI: 10.1145/77626.79170
Z. Xianyi, M. Kroeker, Openblas: An optimized blas library. https://www.openblas.net/ (Access Date: January 2023).
R. Clint Whaley, A. Petitet, J. J. Dongarra, 2001. Automated empirical optimizations of software and the atlas project, Parallel Computing 27 (1), 3–35. DOI: 10.1016/S0167-8191(00)00087-9
T. M. Aamodt, W. W. L. Fung, T. G. Rogers, M. Martonosi, 2018. General-Purpose Graphics Processor Architecture, Morgan and Claypool Publishers.
Nvidia cudnn. https://developer.nvidia.com/cudnn (Access Date: January 2023).
Lapack-linear algebra package. http://www.netlib.org/lapack/ (Access Date: January 2023).
Nvidia-turing architecture white paper. https://images.nvidia.com/aem-dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf (Access Date: May 2022).
Nvidia nsight compute. https://developer.nvidia.com/nsight-compute (Access Date: January 2023).
M. Awatramani, X. Zhu, J. Zambreno, D. Rover, 2015. Phase aware warp scheduling: Mitigating effects of phase behavior in gpgpu applications, International Conference on Parallel Architecture and Compilation (PACT).
Nvidia tesla v100 gpu architecture. https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf (Access Date: May 2022).
S. Markidis, S. W. D. Chien, E. Laure, I. B. Peng, J. S. Vetter, 2018. NVIDIA tensor core programmability, performance and precision, IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).
A. Abdelfattah, H. Anzt, E. G. Boman, E. Carson, T. Cojean, J. Dongarra, A. Fox, M. Gates, N. J. Higham, X. S. Li, J. Loe, P. Luszczek, S. Pranesh, S. Rajamanickam, T. Ribizel, B. F. Smith, K. Swirydowicz, S. Thomas, S. Tomov, Y. M. Tsai, U. M. Yang, 2021. A survey of numerical linear algebra methods utilizing mixed-precision arithmetic, The International Journal of High Performance Computing Applications 35 (4), 344–369. DOI: 10.1177/10943420211003313
Cutlass. https://github.com/NVIDIA/cutlass (Access Date: January 2023).
A. Abdelfattah, S. Tomov, J. Dongarra, 2019. Towards half-precision computation for complex matrices: A case study for mixed precision solvers on gpus, IEEE/ACM 10th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA).
D. Yan, W. Wang, X. Chu, 2020. Demystifying tensor cores to optimize half-precision matrix multiply, IEEE International Parallel and Distributed Processing Symposium (IPDPS).
X. Li, G. Zhang, H. H. Huang, Z. Wang and W. Zheng, 2016. Performance Analysis of GPU-Based Convolutional Neural Networks, 45th International Conference on Parallel Processing (ICPP).
Jon Perez-Cerrolaza, Jaume Abella, Leonidas Kosmidis, Alejandro J. Calderon, Francisco Cazorla, and Jose Luis Flores, 2023. GPU Devices for Safety-Critical Systems: A Survey. ACM Comput. Surv. 55, 7, Article 147. DOI: 10.1145/3549526
Pandey, M., Fernandez, M., Gentile, F. et al. 2022. The transformational role of GPU computing and deep learning in drug discovery. Nat Mach Intell 4, 211–221. DOI: 10.1038/s42256-022-00463-x
Sergio Barrachina, Manuel F. Dolz, Pablo San Juan, Enrique S. Quintana-Ortí, 2022. Efficient and portable GEMM-based convolution operators for deep neural network training on multicore processors, Journal of Parallel and Distributed Computing, 167, 240-254. DOI: 10.1016/j.jpdc.2022.05.009
Susmita Dey Manasi, Suvadeep Banerjee, Abhijit Davare, Anton A. Sorokin, Steven M. Burns, Desmond A. Kirkpatrick, and Sachin S. Sapatnekar, 2023. Reusing GEMM Hardware for Efficient Execution of Depthwise Separable Convolution on ASIC-Based DNN Accelerators. In Proceedings of the 28th Asia and South Pacific Design Automation Conference (ASPDAC).

BLAS Kütüphanelerinin GPU Mimarilerindeki Nicel Performans Analizi

Year 2024, Volume: 26 Issue: 76, 40 - 48, 23.01.2024

Işıl Öz

https://doi.org/10.21205/deufmd.2024267606

Abstract

Temel Lineer Cebir Alt Programları (BLAS), makine öğrenmesi ve bilimsel hesaplama tarafından yaygın olarak kullanılan lineer cebir rutinleri içermektedir. BLAS rutinlerinin optimize edilmiş uygulamalarına sahip BLAS kütüphaneleri, bilgisayar sistemlerindeki paralel yürütme birimlerinden yararlanarak yüksek performans sunmaktadır. Çok sayıda çekirdeğe sahip olan grafik işlemci birimleri, hesaplama açısından ağır iş yükleri için yüksek performans sergilemektedir. Modern BLAS kütüphaneleri, veri paralelliğini kullanarak GPU mimarilerini verimli bir şekilde kullanmaktadır. Bu çalışmada, iki BLAS kütüphanesi (cuBLAS ve MAGMA) fonksiyonları analiz edilmiş, mimari özellikleri ve sınırlamaları göz önünde bulundurularak NVIDIA GPU mimarileri üzerindeki performansları değerlendirilmiştir. Performans metrikleri toplanmış ve kaynak kullanım özellikleri tespit edilmiştir. Çalışmamız, araştırmacıların ve programcıların BLAS rutinlerinin performans davranışını ve GPU kaynak kullanımını anlamalarına yardımcı olmayı amaçlamaktadır.

Keywords

Temel lineer cebir alt programları, Grafik işlemci birimleri, Performans analizi

References

Y. Wang, A. Davidson, Y. Pan, Y. Wu, A. Riffel, J. D. Owens, 2016. Gunrock: A high-performance graph processing library on the gpu, ACM SIGPLAN Notices, 51 (8), 1–12. DOI: 10.1145/3016078.2851145
S. Le Grand, A. W. Götz, R. C. Walker, 2013. Spfp: Speed without compromise—a mixed precision model for gpu accelerated molecular dynamics simulations, Computer Physics Communications 184 (2), 374–380. DOI: 10.1016/j.cpc.2012.09.022
A. Zeni, G. Guidi, M. Ellis, N. Ding, M. D. Santambrogio, S. A. Hofmeyr, A. Buluc ̧, L. Oliker, K. A. Yelick, 2020. LOGAN: high-performance gpu-based x-drop long-read alignment, IEEE International Parallel and Distributed Processing Symposium (IPDPS).
F. F. d. Santos, P. F. Pimenta, C. Lunardi, L. Draghetti, L. Carro, D. Kaeli, P. Rech, 2019. Analyzing and increasing the reliability of convolutional neural networks on gpus, IEEE Transactions on Reliability 68 (2), 663–677. DOI: 10.1109/TR.2018.2878387
S. Alcaide, L. Kosmidis, H. Tabani, C. Hernandez, J. Abella, F. J. Cazorla, 2018. Safety-related challenges and opportunities for gpus in the automotive domain, IEEE Micro 38 (6), 46–55. DOI: 10.1109/MM.2018.2873870
M. Benito, M. M. Trompouki, L. Kosmidis, J. D. Garcia, S. Carretero, K. Wenger, 2021. Comparison of gpu computing methodologies for safety-critical systems: An avionics case study, Design, Automation Test in Europe Conference Exhibition (DATE).
S. Kestur, J. D. Davis, O. Williams, 2010. Blas comparison on fpga, cpu and gpu, IEEE Computer Society Annual Symposium on VLSI.
A. A. Awan, H. Subramoni, D. K. Panda, 2017. An in-depth performance characterization of cpu- and gpu-based dnn training on modern architectures, Proceedings of the Machine Learning on HPC Environments (MLHPC).
A. Abdelfattah, D. Keyes, H. Ltaief, 2016. Kblas: An optimized library for dense matrix-vector multiplication on gpu accelerators, ACM Trans.Math. Softw. 42 (3), 1-31. DOI: 10.1145/2818311
cublas: Basic linear algebra on nvidia gpus. https://developer.nvidia.com/cublas (Access Date: January 2023).
C. Brown, A. Abdelfattah, S. Tomov, J. Dongarra, 2020. Design, optimization, and benchmarking of dense linear algebra algorithms on amd gpus, IEEE High Performance Extreme Computing Conference (HPEC).
rocblas user guide. https://rocblas.readthedocs.io/ (Access Date: January 2023).
J. Dongarra, M. Gates, A. Haidar, J. Kurzak, P. Luszczek, S. Tomov, I. Yamazaki, 2014. Accelerating numerical dense linear algebra calculations with gpus, Numerical Computations with GPUs, Springer, Cham.
L. Wang, W. Wu, Z. Xu, J. Xiao, Y. Yang, 2016. Blasx: A high performance level-3 blas library for heterogeneous multi-gpu computing, Proceedings of the International Conference on Supercomputing, (ICS).
C. Nugteren, 2018. Clblast: A tuned opencl blas library, International Workshop on OpenCL (IWOCL).
J. Dongarra, S. Hammarling, N. J. Higham, S. D. Relton, P. Valero-Lara, M. Zounon, 2017. The design and performance of batched blas on modern high-performance computing systems, Procedia Computer Science 108, 495–504. DOI: 10.1016/j.procs.2017.05.138
F. Li, Y. Ye, Z. Tian, X. Zhang, 2019. CPU versus GPU: which can perform matrix computation faster - performance comparison for basic linear algebra subprograms, Neural Comput. Appl. 31 (8), 4353–4365. DOI: 10.1007/s00521-018-3354-z
S. Ganeshan, N. K. Elumalai, R. Achar, 2020. A comparative study of magma and cublas libraries for gpu based vector fitting, IEEE 11th Latin American Symposium on Circuits Systems (LASCAS).
J. J. Dongarra, J. Du Croz, S. Hammarling, I. S. Duff, 1990. A set of level 3 basic linear algebra subprograms, ACM Transactions on Mathematical Software, 16 (1), 1–17 16 (1). DOI: 10.1145/77626.79170
Z. Xianyi, M. Kroeker, Openblas: An optimized blas library. https://www.openblas.net/ (Access Date: January 2023).
R. Clint Whaley, A. Petitet, J. J. Dongarra, 2001. Automated empirical optimizations of software and the atlas project, Parallel Computing 27 (1), 3–35. DOI: 10.1016/S0167-8191(00)00087-9
T. M. Aamodt, W. W. L. Fung, T. G. Rogers, M. Martonosi, 2018. General-Purpose Graphics Processor Architecture, Morgan and Claypool Publishers.
Nvidia cudnn. https://developer.nvidia.com/cudnn (Access Date: January 2023).
Lapack-linear algebra package. http://www.netlib.org/lapack/ (Access Date: January 2023).
Nvidia-turing architecture white paper. https://images.nvidia.com/aem-dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf (Access Date: May 2022).
Nvidia nsight compute. https://developer.nvidia.com/nsight-compute (Access Date: January 2023).
M. Awatramani, X. Zhu, J. Zambreno, D. Rover, 2015. Phase aware warp scheduling: Mitigating effects of phase behavior in gpgpu applications, International Conference on Parallel Architecture and Compilation (PACT).
Nvidia tesla v100 gpu architecture. https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf (Access Date: May 2022).
S. Markidis, S. W. D. Chien, E. Laure, I. B. Peng, J. S. Vetter, 2018. NVIDIA tensor core programmability, performance and precision, IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).
A. Abdelfattah, H. Anzt, E. G. Boman, E. Carson, T. Cojean, J. Dongarra, A. Fox, M. Gates, N. J. Higham, X. S. Li, J. Loe, P. Luszczek, S. Pranesh, S. Rajamanickam, T. Ribizel, B. F. Smith, K. Swirydowicz, S. Thomas, S. Tomov, Y. M. Tsai, U. M. Yang, 2021. A survey of numerical linear algebra methods utilizing mixed-precision arithmetic, The International Journal of High Performance Computing Applications 35 (4), 344–369. DOI: 10.1177/10943420211003313
Cutlass. https://github.com/NVIDIA/cutlass (Access Date: January 2023).
A. Abdelfattah, S. Tomov, J. Dongarra, 2019. Towards half-precision computation for complex matrices: A case study for mixed precision solvers on gpus, IEEE/ACM 10th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA).
D. Yan, W. Wang, X. Chu, 2020. Demystifying tensor cores to optimize half-precision matrix multiply, IEEE International Parallel and Distributed Processing Symposium (IPDPS).
X. Li, G. Zhang, H. H. Huang, Z. Wang and W. Zheng, 2016. Performance Analysis of GPU-Based Convolutional Neural Networks, 45th International Conference on Parallel Processing (ICPP).
Jon Perez-Cerrolaza, Jaume Abella, Leonidas Kosmidis, Alejandro J. Calderon, Francisco Cazorla, and Jose Luis Flores, 2023. GPU Devices for Safety-Critical Systems: A Survey. ACM Comput. Surv. 55, 7, Article 147. DOI: 10.1145/3549526
Pandey, M., Fernandez, M., Gentile, F. et al. 2022. The transformational role of GPU computing and deep learning in drug discovery. Nat Mach Intell 4, 211–221. DOI: 10.1038/s42256-022-00463-x
Sergio Barrachina, Manuel F. Dolz, Pablo San Juan, Enrique S. Quintana-Ortí, 2022. Efficient and portable GEMM-based convolution operators for deep neural network training on multicore processors, Journal of Parallel and Distributed Computing, 167, 240-254. DOI: 10.1016/j.jpdc.2022.05.009
Susmita Dey Manasi, Suvadeep Banerjee, Abhijit Davare, Anton A. Sorokin, Steven M. Burns, Desmond A. Kirkpatrick, and Sachin S. Sapatnekar, 2023. Reusing GEMM Hardware for Efficient Execution of Depthwise Separable Convolution on ASIC-Based DNN Accelerators. In Proceedings of the 28th Asia and South Pacific Design Automation Conference (ASPDAC).

There are 38 citations in total.

Details

Primary Language	English
Subjects	Engineering
Journal Section	Research Article
Authors	Işıl Öz 0000-0002-8310-1143
Early Pub Date	January 22, 2024
Publication Date	January 23, 2024
Published in Issue	Year 2024 Volume: 26 Issue: 76

Cite

APA	Öz, I. (2024). Quantitative Performance Analysis of BLAS Libraries on GPU Architectures. Dokuz Eylül Üniversitesi Mühendislik Fakültesi Fen Ve Mühendislik Dergisi, 26(76), 40-48. https://doi.org/10.21205/deufmd.2024267606
AMA	Öz I. Quantitative Performance Analysis of BLAS Libraries on GPU Architectures. DEUFMD. January 2024;26(76):40-48. doi:10.21205/deufmd.2024267606
Chicago	Öz, Işıl. “Quantitative Performance Analysis of BLAS Libraries on GPU Architectures”. Dokuz Eylül Üniversitesi Mühendislik Fakültesi Fen Ve Mühendislik Dergisi 26, no. 76 (January 2024): 40-48. https://doi.org/10.21205/deufmd.2024267606.
EndNote	Öz I (January 1, 2024) Quantitative Performance Analysis of BLAS Libraries on GPU Architectures. Dokuz Eylül Üniversitesi Mühendislik Fakültesi Fen ve Mühendislik Dergisi 26 76 40–48.
IEEE	I. Öz, “Quantitative Performance Analysis of BLAS Libraries on GPU Architectures”, DEUFMD, vol. 26, no. 76, pp. 40–48, 2024, doi: 10.21205/deufmd.2024267606.
ISNAD	Öz, Işıl. “Quantitative Performance Analysis of BLAS Libraries on GPU Architectures”. Dokuz Eylül Üniversitesi Mühendislik Fakültesi Fen ve Mühendislik Dergisi 26/76 (January 2024), 40-48. https://doi.org/10.21205/deufmd.2024267606.
JAMA	Öz I. Quantitative Performance Analysis of BLAS Libraries on GPU Architectures. DEUFMD. 2024;26:40–48.
MLA	Öz, Işıl. “Quantitative Performance Analysis of BLAS Libraries on GPU Architectures”. Dokuz Eylül Üniversitesi Mühendislik Fakültesi Fen Ve Mühendislik Dergisi, vol. 26, no. 76, 2024, pp. 40-48, doi:10.21205/deufmd.2024267606.
Vancouver	Öz I. Quantitative Performance Analysis of BLAS Libraries on GPU Architectures. DEUFMD. 2024;26(76):40-8.

Article Files

Full Text

Dokuz Eylül Üniversitesi, Mühendislik Fakültesi Dekanlığı Tınaztepe Yerleşkesi, Adatepe Mah. Doğuş Cad. No: 207-I / 35390 Buca-İZMİR.