Enhancing Caption Fidelity via Explanation-Guided Captioning with Vision-Language Fine-Tuning

Luca Müller
Rodolfo Patel
Sofia Rossi

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Image captioning models have achieved remarkable progress with the introduction of attention mechanisms and transformer-based architectures. However, understanding and diagnosing their predictions remain a challenging task, particularly in terms of attribution, interpretability, and mitigation of hallucinated outputs. In this work, we present \textbf{CAPEV}, a novel explanation-guided fine-tuning paradigm that builds upon Layer-wise Relevance Propagation (LRP) to improve caption reliability and semantic grounding. We begin by systematically adapting state-of-the-art explanation methods—including LRP, Grad-CAM, and Guided Grad-CAM—to image captioning architectures with both adaptive and multi-head attention mechanisms. Unlike conventional attention heatmaps, which offer a coarse visual explanation, these gradient-based and propagation-based methods provide dual-perspective relevance: spatial pixel-level attributions for image regions and token-wise linguistic relevance across sequential inputs. Through rigorous comparisons, we find that these methods yield a more precise and disentangled understanding of the model's decision basis. Building on these insights, we introduce CAPEV, an inference-time fine-tuning approach that leverages explanation signals to recalibrate the internal representations of the model. By identifying both supporting and opposing relevance cues for each word prediction, CAPEV dynamically adjusts context features to suppress hallucinated entities and reinforce grounded content. Notably, CAPEV operates without requiring additional external annotations or human supervision. Extensive experiments on Flickr30K and MSCOCO benchmarks demonstrate that CAPEV significantly reduces object hallucination while preserving caption fluency and overall performance on standard evaluation metrics. Our findings suggest that integrating explainability into the training loop opens a promising avenue toward transparent and trustworthy vision-language generation.

Version published to 10.20944/preprints202508.0076.v1
Aug 1, 2025

Two-Stage Fine-tuning CLIP by Introducing Structure Knowledge for Few-shot Classification

This article has 4 authors:
1. Zhe Zhang
2. Xiang-Gui Guo
3. Junbao Zhuo
4. Huimin Ma
This article has no evaluationsLatest version Jul 30, 2025
Orchestrating Visual and Linguistic Modalities for Robust Spatial Intelligence in LVLMs

This article has 3 authors:
1. Jie-Hao Lim
2. Carter Ross
3. Gavin Walker
This article has no evaluationsLatest version Jul 5, 2025
Efficient AI Systems for Domain Adaptation: LLM-Guided Weighted Contrastive Learning with Reduced Computational Requirements

This article has 1 author:
1. Sujin Kang
This article has no evaluationsLatest version Aug 4, 2025

Listed in

Abstract

Article activity feed

Related articles

Two-Stage Fine-tuning CLIP by Introducing Structure Knowledge for Few-shot Classification

Orchestrating Visual and Linguistic Modalities for Robust Spatial Intelligence in LVLMs

Efficient AI Systems for Domain Adaptation: LLM-Guided Weighted Contrastive Learning with Reduced Computational Requirements