FineRegion-LM: Enhancing Large Vision-Language Models for Fine-Grained Region-Level Understanding
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Large Vision-Language Models (LVLMs) have achieved remarkable success in vision-language tasks, yet they often fall short in fine-grained region-level understanding due to limited spatial sensitivity and insufficient region-specific annotations. To address these challenges, we propose FineRegion-LM, a generative model that enhances LVLMs' capabilities in region comprehension through a novel dual-stage framework. Our approach utilizes dynamic region masking to refine spatial focus and adaptive prompt-based learning for contextual generation. Extensive experiments on benchmark datasets demonstrate that FineRegion-LM significantly outperforms existing methods in region description, object classification, and spatial reasoning tasks. Human evaluations further confirm the effectiveness of our approach in generating accurate and contextually relevant descriptions.