ViT-CAAC: Contribution-Aware Adaptive Compression Framework for Vision Transformers
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The Vision Transformer (ViT) model has emerged as a powerful architecture for visual tasks by enabling the capture of long-range dependencies within images, demonstrating superior performance across a variety of applications. However, the large parameter count, along with high computational and memory demands of ViTs pose significant challenges. This paper introduces ViT-CAAC (Contribution-Aware Adaptive Compression Framework), a novel, multi-faceted compression framework designed to optimize ViTs. Our framework integrates block-level knowledge distillation, layer-wise quantization with precision control across hierarchical layers, and adaptive sparsity, creating a cohesive approach that substantially reduces model size while preserving performance. Through rigorous experimentation on benchmark datasets, we demonstrate that our framework achieves over 76% reduction in model size with minimal accuracy degradation (less than 0.4% Top-1 accuracy loss). This work establishes a novel concept for deploying high-performance vision models on resource-limited devices, with implications for applications in autonomous systems, IoT, and real-time vision processing.