How Good Are Large Language Models at Supporting Frontline Healthcare Workers in Low-Resource Settings – A Benchmarking Study & Dataset

Samuel Rutunda
Gwydion Williams
Kleber Kabanda
Francis Nkurunziz
Solange Uwiduhaye
Eulade Rugegamanzi
Cyprien Nshimiyimana
Vaishnavi Menon
Mira Emmanuel-Fabula
Alastair K. Denniston
Xiaoxuan Liu
Emery Hezagira
Bilal A. Mateen

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Large language models (LLMs) have demonstrated strong performance in medical contexts; however, existing benchmarks often fail to reflect the real-world complexity of low-resource health systems accurately. This study developed a dataset of 5,609 clinical questions contributed by 101 community health workers (CHWs) across four Rwandan districts and compared responses generated by five large language models (LLMs) (Gemini-2, GPT-4o, o3 mini, Deepseek R1, and Meditron-70B) with those from local clinicians. A subset of 524 question-answer pairs was evaluated using a rubric of 11 expert-rated metrics, scored on a five-point Likert scale. Gemini-2 and GPT-4o were the best performers (achieving mean scores of 4.49 and 4.48 out of 5, respectively, across all 11 metrics). All LLMs significantly outperformed local clinicians (ps < 0.001) across all metrics, with Gemini-2, for example, surpassing local GPs by an average of 0.83 points on every metric (range: 0.38 – 1.10). While performance degraded slightly when LLMs communicated in Kinyarwanda, the LLMs remained superior to clinicians and were over 500 times cheaper per response. These findings support the potential of LLMs to strengthen frontline care quality in low-resource, multilingual health systems.

Version published to 10.1101/2025.08.26.25333975 on medRxiv
Aug 28, 2025

From Clinical Judgment to Large Language Models: Benchmarking Predictive Approaches for Unplanned Hospital Admissions

This article has 2 authors:
1. Bernardo Neves
2. Mário J. Silva
This article has no evaluationsLatest version Sep 12, 2025
Beyond Accuracy in Small Open-Source Medical Large Language Models for Pediatric Endocrinology

This article has 5 authors:
1. Vanessa D’Amario
2. Randy Daniel
3. Dhruv Edamadaka
4. Nitya Alaparthy
5. Joshua Tarkoff
This article has no evaluationsLatest version Aug 27, 2025
A Novel Framework for Evaluating the Clinical Reasoning Process of Large Language Models: A Comparative Study in Nephrology

This article has 16 authors:
1. Yuichiro Yano
2. Hiroaki Kakizaki
3. Hajime Nagasu
4. Seiji Kishi
5. Takeo Koshida
6. Yoshihito Nihei
7. Akira Hirano
8. Masaomi Nangaku
9. Hirotake Mori
10. Toshio Naito
11. Mizuki Ohashi
12. Shoichi Maruyama
13. Isao Matsui
14. Yoshitaka Isaka
15. Yusuke Suzuki
16. Naoki Kashihara
This article has no evaluationsLatest version Sep 7, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

From Clinical Judgment to Large Language Models: Benchmarking Predictive Approaches for Unplanned Hospital Admissions

Beyond Accuracy in Small Open-Source Medical Large Language Models for Pediatric Endocrinology

A Novel Framework for Evaluating the Clinical Reasoning Process of Large Language Models: A Comparative Study in Nephrology