A Statistical Framework for Consensus-Based Reliability Assessment in Large Language Model Evaluation Applied to Web Accessibility

B. O. Kuzikov
O. A. Shovkoplias
P. O. Tytov
S. R. Shovkoplias
O. V. Shutylieva
O. V. Vlasenko

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Context . Multi-rater evaluation systems require reliable consensus estimation methods when objective ground truth is unavailable. This challenge is common in domains requiring semantic judgments from multiple, variably reliable evaluators. Methods . We present a statistical framework for consensus-based reliability assessment in ensemble evaluation systems. The methodology employs median aggregation for robust consensus estimation and introduces consistency metrics (R², variance, Spearman correlation) to quantify individual rater alignment. We formalize the consensus problem mathematically, develop core set selection algorithms under cost constraints, and validate the approach using the Intraclass Correlation Coefficient (ICC2k). Theoretical properties include robustness guarantees (50% breakdown point) and ICC monotonicity for nested reference sets. Results . Applied to semantic similarity assessment using 17 Large Language Models on ~ 14,384 samples, the framework achieves ICC2k = 0.977 with a 9-model core and 0.955 with an optimized 3-model core, demonstrating excellent inter-rater reliability. The 3-model configuration reduces computational requirements by 67% while maintaining near-equivalent reliability (ICC decline of only 2.2%). Strong negative correlation (ρ = -0.83) between rater variance and consensus alignment validates the consistency metrics. Conclusions . The framework achieves excellent inter-rater reliability while enabling significant computational cost reduction. Results validate the robustness of median-based consensus estimation and demonstrate the framework's effectiveness for multi-rater evaluation without ground truth. The methodology generalizes to any ordinal-scale consensus problem, providing a statistically validated approach for scalable annotation tasks.

Version published to 10.21203/rs.3.rs-8093408/v1 on Research Square
Nov 13, 2025

The Dice Roll Method: A Standardized Protocol for Measuring Stochastic Bias in Large Language Model Outputs

This article has 1 author:
1. Dmitrij Żatuchin
This article has no evaluationsLatest version Mar 26, 2026
Know When to Trust: Making AI Scoring More Reliable for Educational Assessment

This article has 2 authors:
1. Peter Organisciak
2. Selcuk Acar
This article has no evaluationsLatest version Feb 28, 2026
A reusable Bayesian IRT workflow for instrument validation: A tutorial with a small-sample case study in teacher belief measurement

This article has 2 authors:
1. Chi Zhang
2. Maria Pampaka
This article has no evaluationsLatest version Apr 13, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

The Dice Roll Method: A Standardized Protocol for Measuring Stochastic Bias in Large Language Model Outputs

Know When to Trust: Making AI Scoring More Reliable for Educational Assessment

A reusable Bayesian IRT workflow for instrument validation: A tutorial with a small-sample case study in teacher belief measurement