The Scoring Problem in Multi-Model LLM Benchmarks: How Unreported Methodological Choices Change Hallucination Measurement by 3.5×

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Multi-model LLM benchmarks increasingly rely on LLM-as-judge evaluation to measure hallucination. We identify a critical methodological problem: benchmark results change dramatically based on how three response categories are handled— epistemic abstentions , policy refusals , and judge-ambiguous responses . On TruthfulQA (N = 790, 5 models, 3,950 responses), we demonstrate that hallucination rates shift from 8.9% to 31.3% depending solely on the scoring regime—a 3.5× variation. Human evaluation of 100 stratified ambiguous responses by three annotators reveals that 77% of judge-ambiguous verdicts are missed hallucinations that the LLM judge failed to detect, establishing a ground-truth hallucination rate of 26.1%—three times the rate reported under conservative scoring. A second independent judge (Claude) produces different ambiguity rates (13.7% vs 22.4%), confirming that findings are judge-dependent. We further show that judge ambiguity disproportionately affects open-weight models (34% ambiguous for Llama 70B vs 12.5% for Claude-Sonnet), creating evaluation-induced bias in benchmark rankings. Each model’s reported hallucination rate is not a number but a range: GPT-4o varies 7.2%–27.8% and Llama 70B varies 9.8%–43.8% across six evaluation conditions (2 judges × 3 regimes). Model rankings change 5 times across these 6 conditions. We propose a three-regime scoring framework, recommend dual-judge evaluation with human adjudication, and introduce the concept of epistemic routing as a complementary verification mechanism.

Article activity feed