Human Evaluators vs. LLM-as-a-Judge: Toward Scalable, Real-Time Evaluation of GenAI in Global Health

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Evaluating the outputs of generative AI (GenAI) models in healthcare remains a significant bottleneck for the safe and scalable deployment of these tools. Human expert raters remain the gold standard for assessing the accuracy, contextual appropriateness, and empathy of AI-generated responses, but their assessments are costly, inconsistent, and difficult to scale. The concept of “LLM-as-a-judge” systems, i.e., AI models that can evaluate other AI outputs, has been recently proposed; however, their reliability in global health contexts remains untested. In this study, we systematically compared five LLM-judges and six expert human clinicians in evaluating both human- and AI-generated responses to real-world questions submitted by Rwandan community health workers seeking clinical decision support. Using an adapted version of the Med-PaLM 2 evaluation framework, evaluators scored responses across 11 criteria. Our results show that even the highest-performing LLM-judge (Claude-4.1-Opus) achieved human-equivalent evaluations on only four of eleven criteria. Constructing “LLM juries” to balance model-specific biases improved agreement on only one additional criterion. Some models were consistently overcritical (GPT-5) or overly lenient (Gemini-2.5-Pro). Moreover, performance and cost-effectiveness deteriorated substantially when moving from English to Kinyarwanda inputs. Overall, while LLM-judges demonstrate potential as scalable and internally consistent evaluators of GenAI outputs in healthcare, their sensitivity to linguistic and cultural context is a critical limitation. These findings underscore the need for further investment in scalable evaluation solutions, as well as potentially a fundamental rethink of how we approach the concept of “correctness” in clinical AI assessment (which is currently based on highly inconsistent expert clinician raters).

Article activity feed