Fine-Grained Multimodal Alignment and Iterative Rectification Learning Framework
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Current multimodal models show strong general understanding across vision and language but often struggle with detailed visual grounding, complex reasoning, and spatial consistency. To address these challenges, we introduce a Fine-Grained Multimodal Alignment and Iterative Rectification Learning Framework (FGAM). The framework follows a two-stage paradigm. In the first stage, fine-grained cross-modal pre-training constructs region–text pairs and applies contrastive and spatial consistency objectives to enhance precise visual-semantic alignment. In the second stage, iterative reasoning and rectification fine-tuning introduces a self-evaluation loop where a rectification module reviews and refines model outputs based on visual evidence. Experiments on multiple multimodal backbones and benchmarks demonstrate that FGAM improves fine-grained reasoning, spatial understanding, and reduces hallucinations. Ablation and human evaluations confirm the effectiveness of each component and the overall reliability of the framework.