Visual Hallucination Reduction: An Input-Level Approach for Multimodal Language Model

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Purpose: Visual hallucinations in Large Language Models (LLMs)—where outputs conflict with the visual input—undermine trust and reliability, especially in applications demanding high transparency, factual correctness, and security. While most prior research focuses on post-hoc corrections or model-specific fine-tuning, the potential of input-stage interventions remains underexplored. This study investigates whether preprocessing alone can mitigate hallucinations without modifying model architecture. Methods: We propose an ensemble-based adaptive preprocessing framework that selects the most suitable image filtering strategy—noise-reduced (NR), edge-enhanced (EE), or original (org)—based on the question type. The framework requires no retraining and is model-agnostic. We evaluate our method using the HaloQuest benchmark, which features visually challenging multimodal reasoning tasks. Hallucination levels are assessed using Natural Language Inference (NLI) scores generated via SelfCheckGPT. Results: Our approach achieves a 44.3\% reduction in hallucination rates compared to baseline methods. Notably, this improvement is accomplished without altering the underlying LLM or vision encoder, demonstrating the effectiveness of adaptive preprocessing alone in improving response fidelity. Conclusion: These findings show that intelligent input conditioning can significantly enhance the factual grounding of LLM outputs. Adaptive preprocessing emerges as a lightweight, architecture-agnostic solution for hallucination mitigation, supporting the development of more secure, interpretable, and trustworthy AI systems.

Article activity feed