Advancing Transformer Efficiency with Token Pruning

Xiulan Jie
Yahui Yang
Yong Jianhong

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Transformer-based models have revolutionized natural language processing (NLP), achieving state-of-the-art performance across a wide range of tasks. However, their high computational cost and memory requirements pose significant challenges for real-world deployment, particularly in resource-constrained environments. Token pruning has emerged as a promising technique to improve efficiency by selectively removing less informative tokens during inference, thereby reducing FLOPs and latency while maintaining competitive performance. This survey provides a comprehensive overview of token pruning methods, categorizing them into static, dynamic, and hybrid approaches. We discuss key pruning strategies, including attention-based pruning, entropy-based pruning, reinforcement learning methods, and differentiable token selection. Furthermore, we examine empirical studies that evaluate the trade-offs between efficiency gains and accuracy retention, highlighting the effectiveness of token pruning in various NLP benchmarks. Beyond theoretical advancements, we explore real-world applications of token pruning, including mobile NLP, large-scale language models, streaming applications, and multimodal AI systems. We also outline open research challenges, such as preserving model generalization, optimizing pruning for hardware acceleration, ensuring fairness, and developing automated, adaptive pruning strategies.As deep learning models continue to scale, token pruning represents a crucial step toward making AI systems more efficient and practical for widespread adoption. We conclude by identifying future research directions that can further enhance the effectiveness and applicability of token pruning techniques in modern AI deployments.

Version published to 10.20944/preprints202503.1577.v1
Mar 21, 2025

Token-Level Pruning in Attention Models

This article has 1 author:
1. Shui Xiuying
This article has no evaluationsLatest version Mar 10, 2025
Token Pruning for Efficient NLP, Vision, and Speech Models

This article has 1 author:
1. Yong Jianhong
This article has no evaluationsLatest version Feb 26, 2025
Fine-Tuning Transformers Efficiently: A Survey on LoRA and Its Impact

This article has 2 authors:
1. Muchen Huan
2. Jianhong Shun
This article has no evaluationsLatest version Feb 20, 2025

Listed in

Abstract

Article activity feed

Related articles

Token-Level Pruning in Attention Models

Token Pruning for Efficient NLP, Vision, and Speech Models

Fine-Tuning Transformers Efficiently: A Survey on LoRA and Its Impact