Characterization of Machine Learning Compilers for LLM inference on NVIDIA GPUs

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

AI inference is conflicted between Performance, developer Productivity, and device Portability—the P3 problem. Machine Learning Compilers (MLCs) aim to resolve this, but their ecosystem is fragmented, with tools each prioritizing one issue. This paper evaluates deploying trade-offs of PyTorch-based LLMs on NVIDIA GPUs using four intertwined prominent MLC tools: torch.compile, TensorRT, XLA, and ONNX Runtime. A dual methodology is used, leveraging synthetic PyTorch models to isolate optimizations and end-to-end benchmarks with State-of-The-Art (SoTA) models (TinyLlama-1.1B, Llama-2-7B) to measure real-world performance. Findings reveal that Ahead-Of-Time (AOT) compilation's peak performance requires architecture-specific tools like TensorRT-LLM, necessary for SoTA LLMs but unusable for PyTorch models. As for Just-In-Time (JIT) solutions like torch.compile and its backends, they prove flexible and portable, compatible with all tested models but unable to accelerate LLMs consistently, therefore, the choice of MLC depends on P3 considerations and model architecture.

Article activity feed