Strongly Connected Components Are All You Need: Graph-Theoretic Interpretability and Optimization for Vision Transformers
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Vision Transformers have revolutionized computer vision with unprecedented performance, yet their attention mechanisms remain largely opaque, hindering both trust and optimization in critical applications [1, 2]. Despite significant advances in transformer interpretability, existing approaches provide limited structural insights into how attention patterns encode visual information, creating a fundamental barrier to understanding these powerful models [3, 4]. We address this challenge by introducing a novel, comprehensive graph-theoretic framework for analyzing Vision Transformer attention mechanisms. Our approach treats attention matrices as directed graphs and applies Tarjan’s Strongly Connected Components algorithm to reveal hidden structural patterns within attention flows [5]. Through systematic evaluation of 22 pre-trained models across ViT, DeiT, and CLIP architectures, we discover that Strongly Connected Components with meaningful connectivity directly correspond to semantically coherent visual features—providing a mathematical foundation for understanding attention-based feature learning [6]. This insight enables a novel optimization strategy: layers containing minimal meaningful connectivity can be selectively ablated through attention linearization, achieving 16-30% inference speedup with only 0-7% accuracy loss for large models. Our findings demonstrate that ViT models exhibit superior structural organization (92.6% effectiveness) compared to DeiT (51.1%) and CLIP (57.2%) variants, while patch32 configurations achieve optimal performance tradeoffs [7, 8]. Beyond interpretability, this work establishes graph theory as a powerful lens for transformer analysis and provides practical tools for deploying efficient vision models without sacrificing semantic understanding [9].