Token Pruning for Efficient NLP, Vision, and Speech Models

Yong Jianhong

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The rapid growth of Transformer-based architectures has led to significant advancements in natural language processing (NLP), computer vision, and speech processing. However, their increasing computational demands pose challenges for real-time inference, edge deployment, and energy efficiency. Token pruning has emerged as a promising solution to mitigate these issues by dynamically reducing sequence lengths during model execution while preserving task performance. This survey provides a comprehensive review of token pruning techniques, categorizing them based on their methodologies, such as static vs. dynamic pruning, early exit strategies, and adaptive token selection. We explore their effectiveness across various domains, including text classification, machine translation, object detection, and speech recognition. Additionally, we discuss the trade-offs between efficiency and accuracy, challenges in generalization, and the integration of token pruning with other model compression techniques. Finally, we outline future research directions, emphasizing self-supervised token selection, multimodal pruning, and hardware-aware optimization. By consolidating recent advancements, this survey aims to serve as a foundational reference for researchers and practitioners seeking to enhance the efficiency of deep learning models through token pruning.

Version published to 10.20944/preprints202502.2098.v1
Feb 26, 2025

Advancing Transformer Efficiency with Token Pruning

This article has 3 authors:
1. Xiulan Jie
2. Yahui Yang
3. Yong Jianhong
This article has no evaluationsLatest version Mar 21, 2025
Token-Level Pruning in Attention Models

This article has 1 author:
1. Shui Xiuying
This article has no evaluationsLatest version Mar 10, 2025
Fine-Tuning Transformers Efficiently: A Survey on LoRA and Its Impact

This article has 2 authors:
1. Muchen Huan
2. Jianhong Shun
This article has no evaluationsLatest version Feb 20, 2025

Listed in

Abstract

Article activity feed

Related articles

Advancing Transformer Efficiency with Token Pruning

Token-Level Pruning in Attention Models

Fine-Tuning Transformers Efficiently: A Survey on LoRA and Its Impact