Benchmarking Large Language Models and Clinicians Using Locally Generated Primary Healthcare Vignettes in Kenya

Paul Mwaniki
Wilkister Musau
Lynda Isaaka
Conrad Wanyama
Vaishnavi Menon
Alastair Denniston
Xiaoxuan Liu
Mira Emmanuel-Fabula
Gwydion Williams
Bilal A. Mateen
Ambrose Agweyu

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

Large language models (LLMs) show promise on healthcare tasks, yet most evaluations emphasize multiple-choice accuracy rather than open-ended reasoning. Evidence from low-resource settings remains limited.

Methods

We benchmarked five LLMs (GPT-4.1, Gemini-2.5-Flash, DeepSeek-R1, MedGemma, and o3) against Kenyan clinicians, using a randomly subsampled dataset of 507 vignettes (from a larger pool of 5,107 clinical scenarios) spanning 12 nursing competency categories. Blinded physician panels rated responses using a 5- point Likert scale on an 11-domain rubric covering accuracy, safety, contextual appropriateness, and communication. We summarized mean scores and used Bayesian ordinal logistic regression to estimate probabilities of high-quality ratings (≥4) and to perform pairwise comparisons between LLMs and clinicians.

Findings

Clinician mean ratings were lower than those for LLMs in 9/11 domains: 2.86 vs 4.25-4.72 (guideline alignment), 2.76 vs 4.25-4.73 (expert knowledge), 2.96 vs 4.30-4.73 (logical coherence), and 2.58 vs 4.16-4.68 (low omission of critical information). On safety-related domains, LLMs received higher ratings: minimal extent of possible harm 3.16 vs 4.29-4.68; low likelihood of harm 3.68 vs 4.54-4.81. Performance was similar for low inclusion of irrelevant content (4.28 vs 4.25-4.35) and for avoidance of demographic bias (4.86 vs 4.91-4.94). In Bayesian models, LLMs had >90% probability of ratings ≥4 in most domains, whereas clinicians exceeded 90% only for contextual relevance and demographic/socio-economic bias. Pairwise contrasts showed broadly overlapping credible intervals among LLMs, with o3 leading numerically most domains except contextual relevance, demographic/socio-economic bias, and relevance to the question. Generating all LLM responses cost USD 3.86–8.68 per model (USD 0.008-0.017 per vignette), compared with USD 3.35 per clinician-generated vignette.

Interpretation

LLMs produced responses that were more accurate, safer, and more structured than clinicians in vignette-based tasks. Findings support further evaluation of LLMs as decision support in resource-constrained health systems.

Funding Statement

This study was supported by the Gates Foundation [INV-068056].

Research in Context

Evidence before this study

We searched PubMed, medRxiv, and arXiv (Jan 1, 2021–Sept 30, 2025) using combinations of terms including “large language model”, “LLM”, “healthcare”, “benchmarking”, “clinical decision support”, and “low-resource settings”. The search returned 28 preprints and only 4 peer-reviewed articles. A study from Rwanda benchmarked five LLMs against clinicians using 524 real-world questions from community health workers; all models outperformed clinicians, including in Kinyarwanda (Rutunda, 2025). In Kenya, a multimodal LLM (POE) outperformed primary care providers on 63 otolaryngology cases (79.4% vs 50.8%) and aligned with specialist recommendations (Lechien, 2025). A cross-country maternal health study evaluated GPT-4, GPT-3.5, a custom GPT-3.5, and Meditron-70b on three questions, with expert reviewers in Brazil, Pakistan, and the USA rating outputs in their native languages. GPT-4 and GPT-3.5 were most accurate, though readability and gender bias were noted (Lima, 2025). AraSum, a lightweight Arabic summarization model, outperformed the Arabic foundation model JAIS-30B on BLEU, ROUGE, and expert ratings of accuracy, comprehensiveness, and clinical utility (Lee, 2025). Additional preprints proposed expert-rated benchmarks for LMIC clinical tasks.

Added value of this study

This study uniquely combines local co-design, real-world clinical scenarios, and structured, expert-based assessment across 11 dimensions of clinical quality. It demonstrates the relative strengths and weaknesses of five widely available LLMs versus frontline clinician performance, offering evidence of systematic clinician gaps in accuracy, guideline adherence, and completeness.

Implications of all the available evidence

LLMs show substantial promise as clinical decision support tools in low-resource health systems. Across multiple settings and task types, current models consistently meet or exceed clinician performance in controlled evaluations. However, real-world deployment requires attention to equity, local clinical validation, and thoughtful implementation pathways that mitigate risk and reinforce trust.

Version published to 10.1101/2025.10.25.25338798 on medRxiv
Oct 27, 2025

Beyond Accuracy: An Efficiency- and Safety-Aware Framework for Evaluating Clinical AI with Large Language Models

This article has 4 authors:
1. Nazar Zaki
2. Amal Akor
3. Salahdein Aburuz
4. Sham ZainAlAbdin
This article has no evaluationsLatest version Oct 19, 2025
Does LLM Assistance Improve Healthcare Delivery? An Evaluation Using On-site Physicians and Laboratory Tests∗

This article has 5 authors:
1. Jason Abaluck
2. Robert Pless
3. Nirmal Ravi
4. Anja Sautmann
5. Aaron Schwartz
This article has no evaluationsLatest version Nov 3, 2025
Bias in Large Language Models for Mental Health: Evidence from Vignette-Based Evaluation Across Nine Models

This article has 6 authors:
1. Lerh Jian Wei
2. Annalisa Fang
3. Protik Roychowdhury
4. Oliver Suendermann
5. Rajat Kumar Sinha
6. Sumit Chauhan
This article has no evaluationsLatest version Oct 3, 2025