Performant Automatic BLAS Offloading on Unified Memory Architecture with OpenMP First-Touch Style Data Movement

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

BLAS is a fundamental building block of advanced linear algebra libraries and many modern scientific computing applications. GPU is known for its strong arithmetic computing capability, and highly suited for BLAS operations. However, porting code to GPUs often requires significant effort especially for large complex codes or legacy codes, even for BLAS heavy applications. While various tools exist to automatically offload BLAS to GPU, they are often impractical due to the high costs associated with mandatory data transfers. The advent of unified memory architectures in recent GPU designs, such as the NVIDIA Grace-Hopper, allows cache-coherent memory access across all types of memory for both CPU and GPU, potentially eliminating the bottlenecks faced in conventional architectures. This breakthrough paves the way for innovative application developments and porting strategies. In this paper, building on my preliminary work[1] demonstrating the possibility of performant automatic *gemm offload, I extend the framework to all level-3 BLAS operations, and present SCILIB-Accel[2], a novel tool for automatic BLAS offload . SCILIB-Accel leverages the cache-coherent NVLink C2C interconnect in Grace-Hopper and introduces a Device First-Use data movement policy. This policy, inspired by the OpenMP First-Touch approach in multi-socket CPU programming, minimizes CPU-GPU data transfers for typical scientific computing codes. Additionally, utilizing the dynamic binary instrumentation technique, the tool intercepts BLAS symbols directly from a CPU binary, requiring no code modifications or recompilation. SCILIB-Accel has been evaluated using multiple quantum physics codes on up to a few hundred GPU nodes, yielding promising speedups. Notably, for the LSMS method in the MuST suite, a 3x speedup was achieved on Grace-Hopper compared to Grace-Grace. SCILIB-Accel is the first tool to deliver practical, high-performance automatic BLAS offload for scientific applications.

Article activity feed