Enhancing Large Vision-Language Models via Quantized Grounded Reasoning
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Large Vision-Language Models (LVLMs) have achieved strong results in general visual understanding but remain limited in fine-grained visual reasoning. This paper introduces LVLM-GR, a framework designed to improve detailed visual grounding and robust multimodal reasoning. The proposed Visual Concept Quantizer (VCQ) encodes images into discrete visual tokens through context-aware pooling and a semantic hierarchical codebook, effectively preserving fine-grained semantics. These visual tokens are then aligned with language via a lightweight Grounded Reasoning Adapter (GRA) based on LoRA-tuned adaptation atop a frozen LLaVA 1.5 13B backbone. Experiments on GQA, RefCOCO+, and A-OKVQA show that LVLM-GR achieves superior performance in fine-grained visual understanding, reasoning, and grounding, highlighting its potential for complex multimodal reasoning tasks in material-level and detailed visual analysis.