NeuronMM: High-Performance Matrix Multiplicationfor LLM Inference on AWS Trainium

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

AI accelerators, customized to AI workloads, provide cost- effective and high-performance solutions for training and inference. Trainium, an AI accelerator recently developed by Amazon Web Services (AWS), provides an attractive op- tion for LLM training and inference through its heteroge- neous architecture. However, leveraging Trainium architec- ture for high performance can be challenging because of its systolic array architecture and special requirement on data layout. In this paper, we design high-performance ma- trix multiplication (matmul), a critical compute kernel, for LLM inference on Trainium. We introduce a series of tech- niques customized to Trainium based on kernel fusion and novel caching strategies to reduce data movement across the software-managed memory hierarchy, maximize SRAM bandwidth, and avoid expensive matrix transpose. Evalu- ating with nine datasets and four recent LLMs, we show that our system largely outperforms the state-of-the-art mat- mul implemented by AWS on Trainium: at the level of mat- mul kernel, it achieves an average 1.35× speedup (up to 2.22×), which translates to an average 1.66× speedup (up to 2.49×) for end-to-end LLM inference. Our code is released at https://github.com/dinghongsong/NeuronMM.

Article activity feed