A Multi-agent Court to Mitigate VLM Hallucinations

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Hallucinations have hindered the widespread use of vision language models (VLMs) for domain-specific applications such as road maintenance. While previous researchers constructed multiple solutions for different sources of visual hallucinations, knowledge gaps persist in handling context-dependent hallucinations where the targeted objects are difficult to be prompted precisely. This research explores hallucination by converting image-to-text binary classifications into evidential arguments by VLM agents, each providing a binary Yes/No answer with a justification. The proposed solution involves VLM agents performing distinct roles, starting with a detection unit that uses a primary detector and a reviewer to verify scope compatibility. These agents interact to aggregate their findings and justifications into a single, unified verdict. The different roles of each agent are inspired by the distinctive roles of the prosecutor, the defence counsel and the judge, while the questioning techniques used by the justification reviewer are inspired by lawyers' examination techniques in court room and argumentation schemes. Experiments are performed on toppled poles from road scene images in the Urban Issue dataset, and wider general adoption through subsets of the PhD dataset annotated on COCO2014. Experiments show that our solution achieved a superior overturning rate of 30\% and 2.3 percentage points increase in F1 score in the domain-specific application, with 50\% less time required than the closest multi-agent solution. Comparable detection performance and efficient resource consumption were also seen in the general adoption.

Article activity feed