Token-Level Pruning in Attention Models

Shui Xiuying

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Transformer-based models have revolutionized natural language processing (NLP), achieving state-of-the-art performance across a wide range of tasks. However, their high computational cost and memory requirements pose significant challenges for real-world deployment, particularly in resource-constrained environments. Token pruning has emerged as an effective technique to enhance the efficiency of transformers by dynamically removing less informative tokens during inference, thereby reducing computational complexity while maintaining competitive accuracy. This survey provides a comprehensive review of token pruning methods, categorizing them into attention-based, gradient-based, reinforcement learning-based, and hybrid approaches. We analyze the theoretical foundations behind these techniques, discuss empirical evaluations across various NLP benchmarks, and explore their impact on model accuracy, efficiency, and generalization. Additionally, we examine practical considerations for implementing token pruning in real-world applications, including optimization strategies, hardware compatibility, and challenges related to dynamic execution. Despite the promising results achieved by token pruning, several open research questions remain, such as improving adaptability to different tasks, ensuring robustness under distribution shifts, and developing hardware-aware pruning techniques. We highlight these challenges and outline future research directions to advance the field. By consolidating existing knowledge and identifying key areas for innovation, this survey aims to provide valuable insights for researchers and practitioners seeking to optimize transformer-based models for efficiency without sacrificing performance.

Version published to 10.20944/preprints202503.0590.v1
Mar 10, 2025

Advancing Transformer Efficiency with Token Pruning

This article has 3 authors:
1. Xiulan Jie
2. Yahui Yang
3. Yong Jianhong
This article has no evaluationsLatest version Mar 21, 2025
Token Pruning for Efficient NLP, Vision, and Speech Models

This article has 1 author:
1. Yong Jianhong
This article has no evaluationsLatest version Feb 26, 2025
Fine-Tuning Transformers Efficiently: A Survey on LoRA and Its Impact

This article has 2 authors:
1. Muchen Huan
2. Jianhong Shun
This article has no evaluationsLatest version Feb 20, 2025

Listed in

Abstract

Article activity feed

Related articles

Advancing Transformer Efficiency with Token Pruning

Token Pruning for Efficient NLP, Vision, and Speech Models

Fine-Tuning Transformers Efficiently: A Survey on LoRA and Its Impact