Characterization of Machine Learning Compilers for LLM inference on NVIDIA GPUs

Alejandro Carmona
Gregorio Bernabé
José M. García

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

AI inference is conflicted between Performance, developer Productivity, and device Portability—the P3 problem. Machine Learning Compilers (MLCs) aim to resolve this, but their ecosystem is fragmented, with tools each prioritizing one issue. This paper evaluates deploying trade-offs of PyTorch-based LLMs on NVIDIA GPUs using four intertwined prominent MLC tools: torch.compile, TensorRT, XLA, and ONNX Runtime. A dual methodology is used, leveraging synthetic PyTorch models to isolate optimizations and end-to-end benchmarks with State-of-The-Art (SoTA) models (TinyLlama-1.1B, Llama-2-7B) to measure real-world performance. Findings reveal that Ahead-Of-Time (AOT) compilation's peak performance requires architecture-specific tools like TensorRT-LLM, necessary for SoTA LLMs but unusable for PyTorch models. As for Just-In-Time (JIT) solutions like torch.compile and its backends, they prove flexible and portable, compatible with all tested models but unable to accelerate LLMs consistently, therefore, the choice of MLC depends on P3 considerations and model architecture.

Version published to 10.21203/rs.3.rs-7652970/v1 on Research Square
Oct 10, 2025

On-Device Large Language Models: A Survey of Model Compression and System Optimization

This article has 22 authors:
1. Wanyi Chen
2. Junhao Wang
3. Yiwei Zhang
4. Yufan Shi
5. Tianyi Jiang
6. Shengxian Zhou
7. Chenxu Wu
8. Andi Zhang
9. Chenyue Zhou
10. Minxuan Wang
11. Xinyu Liu
12. Xiaoshuai Hao
13. Yinan Wu
14. Yichen Li
15. Yuwei Hu
16. Zhao Cao
17. Yang Lu
18. Mengke Li
19. Yanbiao Ma
20. Zhiwu Lu
21. Jungong Han
22. Yike Guo
This article has no evaluationsLatest version Nov 21, 2025
Enhancing Transformer Performance and Portability through Auto-tuning Frameworks

This article has 5 authors:
1. Patricia Siwinska
2. Jie Lei
3. Adrián Castelló
4. Pedro Alonso-Jordá
5. Enrique S. Quintana-Ortí
This article has no evaluationsLatest version Oct 29, 2025
NeuronMM: High-Performance Matrix Multiplicationfor LLM Inference on AWS Trainium

This article has 5 authors:
1. Dinghong Song
2. Jierui Xu
3. Weichu Yang
4. Pengfei Su
5. Dong Li
This article has no evaluationsLatest version Nov 14, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

On-Device Large Language Models: A Survey of Model Compression and System Optimization

Enhancing Transformer Performance and Portability through Auto-tuning Frameworks

NeuronMM: High-Performance Matrix Multiplicationfor LLM Inference on AWS Trainium