Efficient Patch Pruning for Vision Transformers via Patch Similarity
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Vision Transformers (ViTs) have emerged as a powerful alternative to convolutional neural networks (CNNs) for visual recognition tasks due to their ability to model long-range dependencies in images through self-attention. However, the computational complexity and memory consumption of ViTs scale quadratically with the number of input patches, making them inefficient, especially for high-resolution images. In this work, we propose a simple yet effective method for patch pruning based on patch similarity, aimed at improving the efficiency of ViTs without compromising their performance. The core idea is to selectively prune patches that exhibit high similarity, reducing redundant information processing while preserving crucial spatial and contextual information. First, we compute a similarity matrix between patches using a distance measure derived from their feature representations. Based on this similarity measure, we identify clusters of highly similar patches, which are subsequently pruned in a manner that minimizes information loss. We show that pruning patches with high redundancy leads to a more compact representation while maintaining the overall performance of the ViT in various image classification tasks. We further explore the impact of different similarity thresholds and pruning strategies on model accuracy and computational efficiency. Experimental results on standard benchmark datasets such as ImageNet demonstrate that our patch pruning method achieves significant reductions in computation and memory usage, with only a marginal decrease in accuracy. In addition, our approach offers flexibility in balancing the trade-off between speed and accuracy, making it a viable solution for deploying Vision Transformers on resource-constrained devices. The simplicity of the method and its effectiveness make it a promising approach for enhancing the scalability and applicability of ViTs, particularly in real-world scenarios where efficiency is paramount.