Beyond References: Human-Aligned Caption Reliability Assessment

Jaxon Carter
Caleb Turner
Ava Martinez
Hailey Peterson

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Despite the rapid progress of modern image captioning systems, the reliability of generated captions in practical deployments often lags behind expectations. In critical scenarios—such as assistive technologies or human-AI interaction platforms—unreliable descriptions may undermine user trust and lead to serious usability issues. To mitigate this gap, we investigate the task of Caption Quality Estimation (CQE) without references, where the objective is to directly judge the appropriateness of a caption based only on its paired image. This paradigm allows untrustworthy outputs to be filtered during inference, offering a proactive safeguard for real-world captioning applications. We introduce VQAR, a novel reference-free framework explicitly crafted to approximate human perception of caption adequacy. Central to this framework is a large-scale dataset we collected, containing over 600,000 binary human judgments across roughly 55,000 $\langle image, caption \rangle$ pairs from 16,000 diverse images. Each annotation acts as a binary signal of visual-semantic compatibility, capturing whether humans deem a caption acceptable for its associated image. To demonstrate both reliability and scalability, we validate the dataset through consistency analyses and benchmark several CQE models. Moreover, we supplement the coarse binary annotations with a fine-grained subset of expert evaluations, enabling us to assess the generalization of learned models. Experimental results show that models trained exclusively on coarse judgments can nonetheless approximate nuanced human preferences, underlining the practicality of VQAR for large-scale deployment. Our contributions are threefold: (i) establishing a reference-independent framework for caption validation; (ii) curating a high-coverage dataset with over 600k human annotations; (iii) providing empirical benchmarks that highlight the difficulty and distinctiveness of CQE compared with conventional captioning or retrieval tasks. By eliminating the dependency on reference captions, VQAR offers a robust, human-centric path toward improving the trustworthiness of captioning systems in interactive and mission-critical environments.

Version published to 10.20944/preprints202509.2538.v1
Sep 30, 2025

Anticipatory Semantics with Bidirectional Guidance for Image Captioning

This article has 3 authors:
1. Noémie Laurent
2. Elodie Fairchild
3. Arthur Delvaux
This article has no evaluationsLatest version Sep 17, 2025
Exploration of Stability Judgments: From Multimodal LLMs to Human Insights

This article has 7 authors:
1. Mury Fajar Dewantoro
2. Febri Abdullah
3. Yi Xia
4. Ibrahim Khan
5. Ruck Thawonmas
6. Wenwen Ouyang
7. Fitra Abdurrachman Bachtiar
This article has no evaluationsLatest version Sep 22, 2025
Context-Aware Multi-Anchor Captioning for Text-Rich Image Understanding

This article has 4 authors:
1. Théo Marchand
2. Lena Roux
3. Saidi Kareem
4. Noe Gauthier
This article has no evaluationsLatest version Oct 10, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Anticipatory Semantics with Bidirectional Guidance for Image Captioning

Exploration of Stability Judgments: From Multimodal LLMs to Human Insights

Context-Aware Multi-Anchor Captioning for Text-Rich Image Understanding