Attention Re-Alignment in Multimodal Large Language Models via Intermediate-Layer Guidance

Yanming Chen
Pandong Wang
Guofeng Qin
Wei Wu
Ming Chen
Yongtao Hao

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Multimodal large language models (MLLMs) have achieved impressive performance in understanding and describing visual content, setting new state-of-the-art results on a variety of visual question answering (VQA) benchmarks. However, during decoding, these models often fail to attend to fine-grained visual details in the input image. Our analysis of intermediate attention layers reveals that MLLMs are not inherently incapable of perceiving target objects; rather, attention to visual details becomes diluted in deeper layers due to the dominance of language priors. To address this limitation, we propose a plug-and-play Attention Re-Alignment module (ARA) that enhances suppressed visual grounding. ARA conducts a layer-wise analysis of the relative attention distribution of image-centric attention heads. It incorporates a confidence-aware layer selection mechanism based on attention peak and entropy, enabling the dynamic aggregation of attention maps from the most informative layers. These aggregated maps are subsequently leveraged to guide the generation of semantic masks, enabling the model to emphasize salient visual regions while suppressing irrelevant or noisy content. ARA can be seamlessly integrated into existing MLLMs and demonstrates consistent improvements across multiple VQA benchmarks, validating its effectiveness in enhancing visual detail sensitivity.

Version published to 10.21203/rs.3.rs-7956158/v1 on Research Square
Nov 14, 2025

Unified Transformer Framework for Integrated Language -Vision Understanding and Content Generation

This article has 2 authors:
1. Anuj Attri
2. HariOm .
This article has no evaluationsLatest version Nov 4, 2025
HISF: Hierarchical Interactive Semantic Fusion for Multi-Modal Prompt Learning

This article has 2 authors:
1. Haohan Feng
2. Chen Li
This article has no evaluationsLatest version Nov 11, 2025
Context-Guided Multi-Branch Fusion for Text-Dependent Visual Question Reasoning

This article has 4 authors:
1. Sander Ridder
2. Noor Verbeeck
3. Callum Hensley
4. Luca Vandenberghe
This article has no evaluationsLatest version Oct 27, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Unified Transformer Framework for Integrated Language -Vision Understanding and Content Generation

HISF: Hierarchical Interactive Semantic Fusion for Multi-Modal Prompt Learning

Context-Guided Multi-Branch Fusion for Text-Dependent Visual Question Reasoning