Automatic GPU Memory Access Optimization for AoSoA-based Application in OP2 Framework
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Portable parallel programming methods are attractive for application developers as hardware architectures are increasingly diverse. OP2 is a domain-specific programming framework for unstructured mesh applications that supports unified programming for multiple hardware platforms. Structure, as a user-defined data type that groups items of possibly different types into a single type, is commonly used in many applications. Current OP2 implementations face limitations in leveraging GPU memory hierarchies when handling complex data organizations, specifically the Array of Structure of Array (AoSoA) patterns, in which each element of the top-level array is a structure of multi-dimensional arrays. To address this issue, we first propose a new SoA (Structure of Array) layout transformation algorithm for AoSoA-based application to optimize the data access locality. Then, we introduce new OP2 primitives to enable CUDA codes to utilize the local memory and the shared memory. These enhancements, integrated into OP2's library and source-to-source translator, enable automatic generation of optimized CUDA code. We evaluate the proposed approaches with a high-order unstructured CFD application on representative GPUs. Compared to the original implementation, the optimized implementation improves the performance for up to 25.68x on NVIDIA V100S, 4.74x on NVIDIA A100 and 3.7x on Hygon Z100 DCU. We also measure a selected set of GPU low-level performance metrics to better explain the results.