Reducing Computational Complexity in Vision Transformers Using Patch Slimming
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Vision Transformers (ViTs) have emerged as a dominant class of deep learning models for image recognition tasks, demonstrating superior performance compared to traditional Convolutional Neural Networks (CNNs) across various benchmark datasets. However, the computational complexity and memory consumption associated with ViTs remain significant challenges, particularly when applied to large-scale datasets or deployed in resource-constrained environments. One of the key contributors to this inefficiency is the patch-based approach utilized by ViTs, where images are divided into fixed-size patches, and each patch is treated as an independent token. This results in a large number of tokens and thus a substantial computational burden in both the attention mechanism and the subsequent layers of the model.In recent years, several strategies have been proposed to mitigate the inefficiencies introduced by the patching mechanism, collectively referred to as Patch Slimming techniques. These techniques aim to reduce the number of patches or tokens, either through selective patch pruning, token aggregation, or dynamic patch selection, while maintaining or even improving the model's performance. The idea behind Patch Slimming is to reduce the amount of redundant information processed by the model, enhance computational efficiency, and decrease memory overhead, without compromising the model's capacity to capture meaningful features in the input image.This survey presents a comprehensive review of the state-of-the-art Patch Slimming techniques for Vision Transformers. We begin by providing a brief overview of Vision Transformers and their inherent inefficiencies, followed by an in-depth discussion of various Patch Slimming methods, including token pruning, patch aggregation, attention-based patch selection, and hybrid approaches that combine multiple strategies. For each method, we examine the underlying principles, implementation details, advantages, and limitations, as well as the trade-offs involved in adopting these techniques for different types of vision tasks. Additionally, we present a detailed analysis of the impact of Patch Slimming on model accuracy, computational cost, and memory consumption, supported by empirical results from recent research.Furthermore, we explore the integration of Patch Slimming with other optimization techniques such as knowledge distillation, model quantization, and hardware-aware design, to further enhance the efficiency of ViTs. We also provide insights into future directions for research in this area, highlighting promising avenues such as adaptive patch selection, transformer model compression, and the use of advanced neural architecture search algorithms for efficient patch representation. Finally, we discuss the challenges and open questions in the field, including the trade-offs between accuracy and efficiency, the potential for real-time deployment, and the generalization of Patch Slimming techniques across diverse vision tasks.In summary, this survey serves as a valuable resource for researchers and practitioners interested in improving the efficiency of Vision Transformers. By providing a thorough review of the existing Patch Slimming methods, their applications, and future research directions, we aim to contribute to the ongoing efforts to make Vision Transformers more accessible and practical for real-world applications, particularly in scenarios where computational resources are limited.