NeuronMM: High-Performance Matrix Multiplicationfor LLM Inference on AWS Trainium

Dinghong Song
Jierui Xu
Weichu Yang
Pengfei Su
Dong Li

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

AI accelerators, customized to AI workloads, provide cost- effective and high-performance solutions for training and inference. Trainium, an AI accelerator recently developed by Amazon Web Services (AWS), provides an attractive op- tion for LLM training and inference through its heteroge- neous architecture. However, leveraging Trainium architec- ture for high performance can be challenging because of its systolic array architecture and special requirement on data layout. In this paper, we design high-performance ma- trix multiplication (matmul), a critical compute kernel, for LLM inference on Trainium. We introduce a series of tech- niques customized to Trainium based on kernel fusion and novel caching strategies to reduce data movement across the software-managed memory hierarchy, maximize SRAM bandwidth, and avoid expensive matrix transpose. Evalu- ating with nine datasets and four recent LLMs, we show that our system largely outperforms the state-of-the-art mat- mul implemented by AWS on Trainium: at the level of mat- mul kernel, it achieves an average 1.35× speedup (up to 2.22×), which translates to an average 1.66× speedup (up to 2.49×) for end-to-end LLM inference. Our code is released at https://github.com/dinghongsong/NeuronMM.

Version published to 10.20944/preprints202511.1093.v1
Nov 14, 2025

Adaptive Dataflow and Precision Optimization for Deep Learning on Configurable Hardware Architectures

This article has 3 authors:
1. Gulnaz Rati
2. Rafael Mendes
3. Aisha Noor
This article has no evaluationsLatest version Oct 8, 2025
Characterization of Machine Learning Compilers for LLM inference on NVIDIA GPUs

This article has 3 authors:
1. Alejandro Carmona
2. Gregorio Bernabé
3. José M. García
This article has no evaluationsLatest version Oct 10, 2025
Characterization of high-resolution AI data center training workloads on single and multiple GPU nodes

This article has 3 authors:
1. Ahmed Abd Elaziz Elsayed
2. Abdullah Azhar Al-Obaidi
3. Hany E.Z. Farag
This article has no evaluationsLatest version Oct 29, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Adaptive Dataflow and Precision Optimization for Deep Learning on Configurable Hardware Architectures

Characterization of Machine Learning Compilers for LLM inference on NVIDIA GPUs

Characterization of high-resolution AI data center training workloads on single and multiple GPU nodes