The Scoring Problem in Multi-Model LLM Benchmarks: How Unreported Methodological Choices Change Hallucination Measurement by 3.5×
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Multi-model LLM benchmarks increasingly rely on LLM-as-judge evaluation to measure hallucination. We identify a critical methodological problem: benchmark results change dramatically based on how three response categories are handled— epistemic abstentions , policy refusals , and judge-ambiguous responses . On TruthfulQA (N = 790, 5 models, 3,950 responses), we demonstrate that hallucination rates shift from 8.9% to 31.3% depending solely on the scoring regime—a 3.5× variation. Human evaluation of 100 stratified ambiguous responses by three annotators reveals that 77% of judge-ambiguous verdicts are missed hallucinations that the LLM judge failed to detect, establishing a ground-truth hallucination rate of 26.1%—three times the rate reported under conservative scoring. A second independent judge (Claude) produces different ambiguity rates (13.7% vs 22.4%), confirming that findings are judge-dependent. We further show that judge ambiguity disproportionately affects open-weight models (34% ambiguous for Llama 70B vs 12.5% for Claude-Sonnet), creating evaluation-induced bias in benchmark rankings. Each model’s reported hallucination rate is not a number but a range: GPT-4o varies 7.2%–27.8% and Llama 70B varies 9.8%–43.8% across six evaluation conditions (2 judges × 3 regimes). Model rankings change 5 times across these 6 conditions. We propose a three-regime scoring framework, recommend dual-judge evaluation with human adjudication, and introduce the concept of epistemic routing as a complementary verification mechanism.