Optimal Complexity in Lightweight Vision Transformers: A Trade-off Analysis between Representational Power and Optimization Efficiency
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The deployment of deep learning models on resource-constrained edge devices necessitates a critical balance between performance and complexity. This study systematically challenges the prevailing assumption that enhancing lightweight vision transformers with sophisticated modules invariably improves performance. By investigating the impact of structural enhancements on the state-of-the-art lightweight Vision Transformer, RepViT-M0.9, our experiments on ImageNet-1K reveal that increasing structural complexity can significantly degrade accuracy and parameter efficiency. Visualizations and feature space analysis suggest that excessive complexity within a lightweight model impairs feature representations and introduces optimization challenges. We propose the Representation-Optimization Trade-off Theory, which models performance as a balance between representational power and optimization cost. Our findings demonstrate that an optimal complexity level exists for lightweight models, beyond which performance deteriorates. This work highlights the importance of structural simplicity and parameter efficiency in developing effective AI solutions for edge devices. The source code and pre-trained models are available at: https://github.com/niyaobuyaochibl/ACR-RepViT with DOI:10.5281/zenodo.16959886.