Attention Re-Alignment in Multimodal Large Language Models via Intermediate-Layer Guidance
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Multimodal large language models (MLLMs) have achieved impressive performance in understanding and describing visual content, setting new state-of-the-art results on a variety of visual question answering (VQA) benchmarks. However, during decoding, these models often fail to attend to fine-grained visual details in the input image. Our analysis of intermediate attention layers reveals that MLLMs are not inherently incapable of perceiving target objects; rather, attention to visual details becomes diluted in deeper layers due to the dominance of language priors. To address this limitation, we propose a plug-and-play Attention Re-Alignment module (ARA) that enhances suppressed visual grounding. ARA conducts a layer-wise analysis of the relative attention distribution of image-centric attention heads. It incorporates a confidence-aware layer selection mechanism based on attention peak and entropy, enabling the dynamic aggregation of attention maps from the most informative layers. These aggregated maps are subsequently leveraged to guide the generation of semantic masks, enabling the model to emphasize salient visual regions while suppressing irrelevant or noisy content. ARA can be seamlessly integrated into existing MLLMs and demonstrates consistent improvements across multiple VQA benchmarks, validating its effectiveness in enhancing visual detail sensitivity.