On-Device Large Language Models: A Survey of Model Compression and System Optimization

Wanyi Chen
Junhao Wang
Yiwei Zhang
Yufan Shi
Tianyi Jiang
Shengxian Zhou
Chenxu Wu
Andi Zhang
Chenyue Zhou
Minxuan Wang
Xinyu Liu
Xiaoshuai Hao
Yinan Wu
Yichen Li
Yuwei Hu
Zhao Cao
Yang Lu
Mengke Li
Yanbiao Ma
Zhiwu Lu
Jungong Han
Yike Guo

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Large language models are increasingly deployed on device and at the edge, where memory capacity, bandwidth, latency, and privacy requirements dominate system behavior. This survey systematizes the end side stack from algorithms to systems. On the model side, we present a clear taxonomy of quantization, pruning, knowledge distillation, low rank adaptation, and hybrid pipelines, explaining where representative methods belong and how they compose. On the system side, we link these techniques to inference frameworks, compiler and runtime optimizations, kernel fusion, and explicit management of the KV cache. We further propose a unified ALEM protocol, namely Accuracy, Latency, Energy, and Memory, and instantiate it on representative models from 1 to 4 billion parameters to reveal practical trade offs: apply quantization first for memory and time to first token, pair structured pruning with mergeable low rank compensation, and treat the KV cache as a first class subsystem through paging, compression, and eviction. Finally, we outline open problems and directions, including a unified low bit pipeline that couples transform, calibration, and kernel fusion, joint search over structured pruning and distillation, and train and serve unification that collapses sparse, quantized, and low rank parameters into inference ready weights. The goal is a practical bridge from algorithmic compression to resource aligned and reliable on device and edge deployment.https://github.com/LumosJiang/Awesome-On-Device-LLMs: a repository hosting the complete references for Sections 3–4.

Version published to 10.21203/rs.3.rs-7975734/v1 on Research Square
Nov 21, 2025

Characterization of Machine Learning Compilers for LLM inference on NVIDIA GPUs

This article has 3 authors:
1. Alejandro Carmona
2. Gregorio Bernabé
3. José M. García
This article has no evaluationsLatest version Oct 10, 2025
Enhancing Transformer Performance and Portability through Auto-tuning Frameworks

This article has 5 authors:
1. Patricia Siwinska
2. Jie Lei
3. Adrián Castelló
4. Pedro Alonso-Jordá
5. Enrique S. Quintana-Ortí
This article has no evaluationsLatest version Oct 29, 2025
ByteHD: Efficient Byte-Level Hypervector Compression for Memory-Constrained Embedded Systems

This article has 6 authors:
1. Víctor Ortega Gómez
2. María Soledad Escolar Díaz
3. Fernando Rincón Calle
4. Jesús Barba Romero
5. Julián Caba Jiménez
6. Juan Carlos López López
This article has no evaluationsLatest version Nov 13, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Characterization of Machine Learning Compilers for LLM inference on NVIDIA GPUs

Enhancing Transformer Performance and Portability through Auto-tuning Frameworks

ByteHD: Efficient Byte-Level Hypervector Compression for Memory-Constrained Embedded Systems