Beyond References: Human-Aligned Caption Reliability Assessment

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Despite the rapid progress of modern image captioning systems, the reliability of generated captions in practical deployments often lags behind expectations. In critical scenarios—such as assistive technologies or human-AI interaction platforms—unreliable descriptions may undermine user trust and lead to serious usability issues. To mitigate this gap, we investigate the task of Caption Quality Estimation (CQE) without references, where the objective is to directly judge the appropriateness of a caption based only on its paired image. This paradigm allows untrustworthy outputs to be filtered during inference, offering a proactive safeguard for real-world captioning applications. We introduce VQAR, a novel reference-free framework explicitly crafted to approximate human perception of caption adequacy. Central to this framework is a large-scale dataset we collected, containing over 600,000 binary human judgments across roughly 55,000 $\langle image, caption \rangle$ pairs from 16,000 diverse images. Each annotation acts as a binary signal of visual-semantic compatibility, capturing whether humans deem a caption acceptable for its associated image. To demonstrate both reliability and scalability, we validate the dataset through consistency analyses and benchmark several CQE models. Moreover, we supplement the coarse binary annotations with a fine-grained subset of expert evaluations, enabling us to assess the generalization of learned models. Experimental results show that models trained exclusively on coarse judgments can nonetheless approximate nuanced human preferences, underlining the practicality of VQAR for large-scale deployment. Our contributions are threefold: (i) establishing a reference-independent framework for caption validation; (ii) curating a high-coverage dataset with over 600k human annotations; (iii) providing empirical benchmarks that highlight the difficulty and distinctiveness of CQE compared with conventional captioning or retrieval tasks. By eliminating the dependency on reference captions, VQAR offers a robust, human-centric path toward improving the trustworthiness of captioning systems in interactive and mission-critical environments.

Article activity feed