A Statistical Framework for Consensus-Based Reliability Assessment in Large Language Model Evaluation Applied to Web Accessibility
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Context . Multi-rater evaluation systems require reliable consensus estimation methods when objective ground truth is unavailable. This challenge is common in domains requiring semantic judgments from multiple, variably reliable evaluators. Methods . We present a statistical framework for consensus-based reliability assessment in ensemble evaluation systems. The methodology employs median aggregation for robust consensus estimation and introduces consistency metrics (R², variance, Spearman correlation) to quantify individual rater alignment. We formalize the consensus problem mathematically, develop core set selection algorithms under cost constraints, and validate the approach using the Intraclass Correlation Coefficient (ICC2k). Theoretical properties include robustness guarantees (50% breakdown point) and ICC monotonicity for nested reference sets. Results . Applied to semantic similarity assessment using 17 Large Language Models on ~ 14,384 samples, the framework achieves ICC2k = 0.977 with a 9-model core and 0.955 with an optimized 3-model core, demonstrating excellent inter-rater reliability. The 3-model configuration reduces computational requirements by 67% while maintaining near-equivalent reliability (ICC decline of only 2.2%). Strong negative correlation (ρ = -0.83) between rater variance and consensus alignment validates the consistency metrics. Conclusions . The framework achieves excellent inter-rater reliability while enabling significant computational cost reduction. Results validate the robustness of median-based consensus estimation and demonstrate the framework's effectiveness for multi-rater evaluation without ground truth. The methodology generalizes to any ordinal-scale consensus problem, providing a statistically validated approach for scalable annotation tasks.