Human Evaluators vs. LLM-as-a-Judge: Toward Scalable, Real-Time Evaluation of GenAI in Global Health

Gwydion Williams
Samuel Rutunda
Floris Nzabakira
Bilal A Mateen

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Evaluating the outputs of generative AI (GenAI) models in healthcare remains a significant bottleneck for the safe and scalable deployment of these tools. Human expert raters remain the gold standard for assessing the accuracy, contextual appropriateness, and empathy of AI-generated responses, but their assessments are costly, inconsistent, and difficult to scale. The concept of “LLM-as-a-judge” systems, i.e., AI models that can evaluate other AI outputs, has been recently proposed; however, their reliability in global health contexts remains untested. In this study, we systematically compared five LLM-judges and six expert human clinicians in evaluating both human- and AI-generated responses to real-world questions submitted by Rwandan community health workers seeking clinical decision support. Using an adapted version of the Med-PaLM 2 evaluation framework, evaluators scored responses across 11 criteria. Our results show that even the highest-performing LLM-judge (Claude-4.1-Opus) achieved human-equivalent evaluations on only four of eleven criteria. Constructing “LLM juries” to balance model-specific biases improved agreement on only one additional criterion. Some models were consistently overcritical (GPT-5) or overly lenient (Gemini-2.5-Pro). Moreover, performance and cost-effectiveness deteriorated substantially when moving from English to Kinyarwanda inputs. Overall, while LLM-judges demonstrate potential as scalable and internally consistent evaluators of GenAI outputs in healthcare, their sensitivity to linguistic and cultural context is a critical limitation. These findings underscore the need for further investment in scalable evaluation solutions, as well as potentially a fundamental rethink of how we approach the concept of “correctness” in clinical AI assessment (which is currently based on highly inconsistent expert clinician raters).

Version published to 10.1101/2025.10.27.25338910 on medRxiv
Oct 28, 2025

Beyond Accuracy: An Efficiency- and Safety-Aware Framework for Evaluating Clinical AI with Large Language Models

This article has 4 authors:
1. Nazar Zaki
2. Amal Akor
3. Salahdein Aburuz
4. Sham ZainAlAbdin
This article has no evaluationsLatest version Oct 19, 2025
Evaluating the Ethical and Clinical Implications of Generative AI in Patient-Centric Medical Applications

This article has 1 author:
1. zobia shabeer
This article has no evaluationsLatest version Sep 25, 2025
Designing Psychometric Measures for LLMs: Framework and Application to Racial Bias

This article has 1 author:
1. Mouhacine Benosman
This article has no evaluationsLatest version Oct 3, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Beyond Accuracy: An Efficiency- and Safety-Aware Framework for Evaluating Clinical AI with Large Language Models

Evaluating the Ethical and Clinical Implications of Generative AI in Patient-Centric Medical Applications

Designing Psychometric Measures for LLMs: Framework and Application to Racial Bias