A Unified Multi-Domain Framework for Hallucination Detection and Reliability Evaluation in Large Language Models
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Large Language Models which consist of models like GPT5 from OpenAI, Claude Sonnet from Anthropic, Google’s Gemini are known for their reasoning capability, ability to summarize, cybersecurity analysis and other features. However, their reliability could be an issue particularly when dealing with ambiguous or hostile user input prompts. These kinds of attacks could lead to soft failures like false citations, hallucinations, and problematic suggestions. Current benchmarks (eg: TruthfulQA, AdvGLUE, and JailbreakBench) assess discrete aspects of robustness but are unable to identify multi-domain, conversational vulnerabilities that come up in practical applications. This paper introduces a novel dataset called MDH-Bench, a benchmark comprising 400 prompts that were specifically created to assess how reliably Large Language Models perform when they are provided with adversarial but natural and even tricky real-world queries. The benchmark covers eight types of challenges. They are numerical reasoning traps, security-sensitive defaults, historical and temporal inconsistencies, fictional research scenarios, contradictory context mixing, ethical provocations, and prompt-injection attempts. To monitor reliability in a holistic way, we present a Unified Multi-Domain Reliability Evaluation Framework that brings together several key metrics like accuracy, hallucination rate, hallucination degree (HD), unsafe output rate, and contradiction rate. A systematic assessment of four high-performance LLMs (GPT-5, Claude Sonnet 4, Gemini 2.5 Pro, DeepSeek V3.1) was conducted under controlled, manual, multi-annotator settings. The findings indicate that although overall accuracy is more than 88% in all models, hallucinations are still present both at high frequency and severity level. The results of our investigation showed that solely relying on accuracy is not enough for evaluating the reliability of a system. MDH-Bench and the combined reliability scoring system give a wide-ranging, adaptable base for the assessment and enhancement of trust in future LLMs for real-world applications.