Efficient Layer-wise Attribution Method for ScalableExplainability in VLMs
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Large vision-language models (VLMs) with billions of parameters are increasingly deployed in high-stakes applications suchas medical diagnosis, autonomous driving, and content moderation, yet their decision-making processes remain opaque.Existing explainable AI (XAI) methods face severe computational bottlenecks when applied to these large-scale models, withexplanation generation times exceeding 45 seconds per sample on standard hardware, limiting their practical utility in real-worldscenarios. This study addresses the critical gap by proposing Efficient Layer-wise Attribution Method (ELAM), a novel scalableXAI approach that leverages layer-wise gradient approximation and selective attention mechanism analysis to generate faithfulexplanations with significantly reduced computational overhead. We evaluate ELAM on three state-of-the-art VLMs spanningthree orders of magnitude in model size: CLIP-ViT-L/14 (428M parameters), BLIP-2 (2.7B parameters), and LLaVA-1.5-7B (7Bparameters) across 2,500 image-text pairs from MS-COCO and Flickr30k datasets. Experimental results demonstrate thatELAM achieves 87.3% computational efficiency improvement (7.8× to 11.9× speedup) over gradient-based baselines whilemaintaining 94.2% explanation fidelity as measured by insertion-deletion scores. Furthermore, ELAM successfully scales tomodels with up to 7 billion parameters, reducing explanation generation time from 45.2 seconds to 3.8 seconds per sampleon standard hardware. Our method provides a practical solution for deploying transparent and accountable VLMs in criticaldomains where both accuracy and interpretability are essential.