The Scoring Problem in Multi-Model LLM Benchmarks: How Unreported Methodological Choices Change Hallucination Measurement by 3.5×

AZRIL BIN HAMZAH
SHASHA TENG

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Multi-model LLM benchmarks increasingly rely on LLM-as-judge evaluation to measure hallucination. We identify a critical methodological problem: benchmark results change dramatically based on how three response categories are handled— epistemic abstentions , policy refusals , and judge-ambiguous responses . On TruthfulQA (N = 790, 5 models, 3,950 responses), we demonstrate that hallucination rates shift from 8.9% to 31.3% depending solely on the scoring regime—a 3.5× variation. Human evaluation of 100 stratified ambiguous responses by three annotators reveals that 77% of judge-ambiguous verdicts are missed hallucinations that the LLM judge failed to detect, establishing a ground-truth hallucination rate of 26.1%—three times the rate reported under conservative scoring. A second independent judge (Claude) produces different ambiguity rates (13.7% vs 22.4%), confirming that findings are judge-dependent. We further show that judge ambiguity disproportionately affects open-weight models (34% ambiguous for Llama 70B vs 12.5% for Claude-Sonnet), creating evaluation-induced bias in benchmark rankings. Each model’s reported hallucination rate is not a number but a range: GPT-4o varies 7.2%–27.8% and Llama 70B varies 9.8%–43.8% across six evaluation conditions (2 judges × 3 regimes). Model rankings change 5 times across these 6 conditions. We propose a three-regime scoring framework, recommend dual-judge evaluation with human adjudication, and introduce the concept of epistemic routing as a complementary verification mechanism.

Version published to 10.21203/rs.3.rs-9240163/v1 on Research Square
Mar 30, 2026

How Can Hallucinatory Biases Be Effectively Audited and Mitigated in Vision-Language Models? A Unified Theoretical and Empirical Framework Across GPT-4o, Grok 3, and Claude Sonnet 4.5

This article has 1 author:
1. Amirali Ghajari
This article has no evaluationsLatest version Apr 8, 2026
Bidirectional Dissociation Between Self-Report and Behavior in AI Status Sensitivity

This article has 1 author:
1. Dustin James
This article has no evaluationsLatest version Mar 26, 2026
A Scalable Four-Level Functional Hierarchy for Evaluating Large Language Models: Hallucination, Self-Monitoring, and the Hypothesized Structural Advantages of Arabic and Chinese

This article has 1 author:
1. El Khalil Baroudi
This article has no evaluationsLatest version Mar 24, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

How Can Hallucinatory Biases Be Effectively Audited and Mitigated in Vision-Language Models? A Unified Theoretical and Empirical Framework Across GPT-4o, Grok 3, and Claude Sonnet 4.5

Bidirectional Dissociation Between Self-Report and Behavior in AI Status Sensitivity

A Scalable Four-Level Functional Hierarchy for Evaluating Large Language Models: Hallucination, Self-Monitoring, and the Hypothesized Structural Advantages of Arabic and Chinese