Efficient Layer-wise Attribution Method for ScalableExplainability in VLMs

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Large vision-language models (VLMs) with billions of parameters are increasingly deployed in high-stakes applications suchas medical diagnosis, autonomous driving, and content moderation, yet their decision-making processes remain opaque.Existing explainable AI (XAI) methods face severe computational bottlenecks when applied to these large-scale models, withexplanation generation times exceeding 45 seconds per sample on standard hardware, limiting their practical utility in real-worldscenarios. This study addresses the critical gap by proposing Efficient Layer-wise Attribution Method (ELAM), a novel scalableXAI approach that leverages layer-wise gradient approximation and selective attention mechanism analysis to generate faithfulexplanations with significantly reduced computational overhead. We evaluate ELAM on three state-of-the-art VLMs spanningthree orders of magnitude in model size: CLIP-ViT-L/14 (428M parameters), BLIP-2 (2.7B parameters), and LLaVA-1.5-7B (7Bparameters) across 2,500 image-text pairs from MS-COCO and Flickr30k datasets. Experimental results demonstrate thatELAM achieves 87.3% computational efficiency improvement (7.8× to 11.9× speedup) over gradient-based baselines whilemaintaining 94.2% explanation fidelity as measured by insertion-deletion scores. Furthermore, ELAM successfully scales tomodels with up to 7 billion parameters, reducing explanation generation time from 45.2 seconds to 3.8 seconds per sampleon standard hardware. Our method provides a practical solution for deploying transparent and accountable VLMs in criticaldomains where both accuracy and interpretability are essential.

Article activity feed