Efficient Patch Pruning for Vision Transformers via Patch Similarity

Haoyu Han
Shui Xiuying

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Vision Transformers (ViTs) have emerged as a powerful alternative to convolutional neural networks (CNNs) for visual recognition tasks due to their ability to model long-range dependencies in images through self-attention. However, the computational complexity and memory consumption of ViTs scale quadratically with the number of input patches, making them inefficient, especially for high-resolution images. In this work, we propose a simple yet effective method for patch pruning based on patch similarity, aimed at improving the efficiency of ViTs without compromising their performance. The core idea is to selectively prune patches that exhibit high similarity, reducing redundant information processing while preserving crucial spatial and contextual information. First, we compute a similarity matrix between patches using a distance measure derived from their feature representations. Based on this similarity measure, we identify clusters of highly similar patches, which are subsequently pruned in a manner that minimizes information loss. We show that pruning patches with high redundancy leads to a more compact representation while maintaining the overall performance of the ViT in various image classification tasks. We further explore the impact of different similarity thresholds and pruning strategies on model accuracy and computational efficiency. Experimental results on standard benchmark datasets such as ImageNet demonstrate that our patch pruning method achieves significant reductions in computation and memory usage, with only a marginal decrease in accuracy. In addition, our approach offers flexibility in balancing the trade-off between speed and accuracy, making it a viable solution for deploying Vision Transformers on resource-constrained devices. The simplicity of the method and its effectiveness make it a promising approach for enhancing the scalability and applicability of ViTs, particularly in real-world scenarios where efficiency is paramount.

Version published to 10.20944/preprints202504.0596.v1
Apr 8, 2025

Reducing Computational Complexity in Vision Transformers Using Patch Slimming

This article has 1 author:
1. Yong Jianhong
This article has no evaluationsLatest version Mar 21, 2025
Partial Convolution Meets Visual Attention

This article has 8 authors:
1. Haiduo Huang
2. Fuwei Yang
3. Dong Li
4. Ji Liu
5. Lu Tian
6. Jinzhang Peng
7. Pengju Ren
8. Emad Barsoum
This article has no evaluationsLatest version Mar 21, 2025
Breaking the Bottleneck Advances in Efficient Transformer Design

This article has 1 author:
1. Yawen Bao
This article has no evaluationsLatest version Feb 28, 2025

Listed in

Abstract

Article activity feed

Related articles

Reducing Computational Complexity in Vision Transformers Using Patch Slimming

Partial Convolution Meets Visual Attention

Breaking the Bottleneck Advances in Efficient Transformer Design