Advancing Transformer Efficiency with Token Pruning
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Transformer-based models have revolutionized natural language processing (NLP), achieving state-of-the-art performance across a wide range of tasks. However, their high computational cost and memory requirements pose significant challenges for real-world deployment, particularly in resource-constrained environments. Token pruning has emerged as a promising technique to improve efficiency by selectively removing less informative tokens during inference, thereby reducing FLOPs and latency while maintaining competitive performance. This survey provides a comprehensive overview of token pruning methods, categorizing them into static, dynamic, and hybrid approaches. We discuss key pruning strategies, including attention-based pruning, entropy-based pruning, reinforcement learning methods, and differentiable token selection. Furthermore, we examine empirical studies that evaluate the trade-offs between efficiency gains and accuracy retention, highlighting the effectiveness of token pruning in various NLP benchmarks. Beyond theoretical advancements, we explore real-world applications of token pruning, including mobile NLP, large-scale language models, streaming applications, and multimodal AI systems. We also outline open research challenges, such as preserving model generalization, optimizing pruning for hardware acceleration, ensuring fairness, and developing automated, adaptive pruning strategies.As deep learning models continue to scale, token pruning represents a crucial step toward making AI systems more efficient and practical for widespread adoption. We conclude by identifying future research directions that can further enhance the effectiveness and applicability of token pruning techniques in modern AI deployments.