Automatic GPU Memory Access Optimization for AoSoA-based Application in OP2 Framework

Tong Lei
Zongjing Chen
Yonggang Che
Chuanfu Xu
Zhe Dai
Jian Zhang

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Portable parallel programming methods are attractive for application developers as hardware architectures are increasingly diverse. OP2 is a domain-specific programming framework for unstructured mesh applications that supports unified programming for multiple hardware platforms. Structure, as a user-defined data type that groups items of possibly different types into a single type, is commonly used in many applications. Current OP2 implementations face limitations in leveraging GPU memory hierarchies when handling complex data organizations, specifically the Array of Structure of Array (AoSoA) patterns, in which each element of the top-level array is a structure of multi-dimensional arrays. To address this issue, we first propose a new SoA (Structure of Array) layout transformation algorithm for AoSoA-based application to optimize the data access locality. Then, we introduce new OP2 primitives to enable CUDA codes to utilize the local memory and the shared memory. These enhancements, integrated into OP2's library and source-to-source translator, enable automatic generation of optimized CUDA code. We evaluate the proposed approaches with a high-order unstructured CFD application on representative GPUs. Compared to the original implementation, the optimized implementation improves the performance for up to 25.68x on NVIDIA V100S, 4.74x on NVIDIA A100 and 3.7x on Hygon Z100 DCU. We also measure a selected set of GPU low-level performance metrics to better explain the results.

Version published to 10.21203/rs.3.rs-6318689/v1 on Research Square
Jun 17, 2025

First Fully Pipelined High Throughput FPGA Implementation and GPU Optimization of Wider Variant of AES

This article has 2 authors:
1. Ahmet MALAL
2. Cihangir TEZCAN
This article has no evaluationsLatest version Jul 16, 2025
Reconfigurable Acceleration of Deep Learning Workloads with FPGA-Based Architectures in Edge and Embedded Systems

This article has 5 authors:
1. Lucas Oliveira
2. Camila Ferreira
3. Thiago Souza
4. Gulnaz Rati
5. Mariana Costa
This article has no evaluationsLatest version Jul 15, 2025
Page Faults Minimization for Virtual Memory Systems Using Working Set Strategy

This article has 5 authors:
1. Aslanbek Murzakhmetov
2. Gaukhar Borankulova
3. Arseniy Bapanov
4. Zhanna Sadirmekova
5. Gabit Altybaev
This article has no evaluationsLatest version Jul 7, 2025

Listed in

Abstract

Article activity feed

Related articles

First Fully Pipelined High Throughput FPGA Implementation and GPU Optimization of Wider Variant of AES

Reconfigurable Acceleration of Deep Learning Workloads with FPGA-Based Architectures in Edge and Embedded Systems

Page Faults Minimization for Virtual Memory Systems Using Working Set Strategy