Benchmarking Large Language Models and Clinicians Using Locally Generated Primary Healthcare Vignettes in Kenya

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background: Large language models (LLMs) show promise on healthcare tasks, yet most evaluations emphasize multiple-choice accuracy rather than open-ended reasoning. Evidence from low-resource settings remains limited. Methods: We benchmarked five LLMs (GPT-4.1, Gemini-2.5-Flash, DeepSeek-R1, MedGemma, and o3) against Kenyan clinicians, using a randomly subsampled dataset of 507 vignettes (from a larger pool of 5,107 clinical scenarios) spanning 12 nursing competency categories. Blinded physician panels rated responses using a 5-point Likert scale on an 11-domain rubric covering accuracy, safety, contextual appropriateness, and communication. We summarized mean scores and used Bayesian ordinal logistic regression to estimate probabilities of high-quality ratings (≥4) and to perform pairwise comparisons between LLMs and clinicians. Findings: Clinician mean ratings were lower than those for LLMs in 9/11 domains: 2.86 vs 4.25-4.72 (guideline alignment), 2.76 vs 4.25-4.73 (expert knowledge), 2.96 vs 4.30-4.73 (logical coherence), and 2.58 vs 4.16-4.68 (low omission of critical information). On safety-related domains, LLMs received higher ratings: minimal extent of possible harm 3.16 vs 4.29-4.68; low likelihood of harm 3.68 vs 4.54-4.81. Performance was similar for low inclusion of irrelevant content (4.28 vs 4.25-4.35) and for avoidance of demographic bias (4.86 vs 4.91-4.94). In Bayesian models, LLMs had >90% probability of ratings ≥4 in most domains, whereas clinicians exceeded 90% only for contextual relevance and demographic/socio-economic bias. Pairwise contrasts showed broadly overlapping credible intervals among LLMs, with o3 leading numerically most domains except contextual relevance, demographic/socio-economic bias, and relevance to the question. Generating all LLM responses cost USD 3.86-8.68 per model (USD 0.008-0.017 per vignette), compared with USD 3.35 per clinician-generated vignette. Interpretation: LLMs produced responses that were more accurate, safer, and more structured than clinicians in vignette-based tasks. Findings support further evaluation of LLMs as decision support in resource-constrained health systems. Funding Statement: This study was supported by the Gates Foundation [INV-068056].

Article activity feed