Probing the Surgical Competence of LLMs: A global health study leveraging AfriMedQA benchmarks

Tobi Olatunji
Folafunmi Omofoye
Ezinwanne C. Aka
Gina Itzikowitz
Daniel Macaulay
OyinOluwa G. Adaramola
Boluwatife A. Adewale
Chidi Asuzu
Simisola Popoola
Wendy Kinara
Emmanuel Ayodele
Ifeoluwa Yinusa
Oluwatoni Adekunle
Mardhiyah Sanni
Chibuzor Okocha
Tassallah Abdullahi
Abraham Owoduni
Charles Nimo
Mercy N. Asiedu
Bilal Mateen
Rebecca Weintraub

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Global surgical care faces a severe workforce shortage, with more than 1.2 million additional specialists needed by 2030, particularly in low- and middle-income countries (LMICs). Large language models (LLMs) have demonstrated impressive medical reasoning on standardized exams, but their safety, reliability, and specialty-specific performance—especially in procedural fields such as surgery—remain uncertain. Here we evaluate over 40 state-of-the-art LLMs on 3,900 expert-authored multiple-choice questions across 32 medical specialties from the AfriMed-QA benchmark, developed by 20 African medical professors. Top models (o1, GPT-4o, Claude 3.5) achieved mean accuracies exceeding 82%, showing strong diagnostic reasoning, yet consistently underperformed in surgery, pathology, and obstetrics compared with medical disciplines. Error analyses revealed frequent procedural reasoning failures, omission of local clinical guidelines, and overconfident but incorrect answers. Smaller or biomedical models exhibited higher hallucination and formatting error rates, while prompting strategies had inconsistent benefits. These results highlight the uneven readiness of LLMs for specialty-specific decision support and underscore the need for locally grounded evaluation frameworks, improved instruction tuning, and rigorous real-world validation to ensure the safe and equitable deployment of AI-assisted clinical tools in LMICs.

Version published to 10.1101/2025.10.05.25337350 on medRxiv
Oct 7, 2025

Does LLM Assistance Improve Healthcare Delivery? An Evaluation Using On-site Physicians and Laboratory Tests∗

This article has 5 authors:
1. Jason Abaluck
2. Robert Pless
3. Nirmal Ravi
4. Anja Sautmann
5. Aaron Schwartz
This article has no evaluationsLatest version Nov 3, 2025
Performance of GPT-5, DeepSeek, and Claude in Dental MCQs for Medically Compromised Patients

This article has 4 authors:
1. Omran Altos
2. Gang Chen
3. Ahmed Bashah
4. Ahmed Awad
This article has no evaluationsLatest version Oct 10, 2025
Benchmarking Large Language Models and Clinicians Using Locally Generated Primary Healthcare Vignettes in Kenya

This article has 11 authors:
1. Paul Mwaniki
2. Wilkister Musau
3. Lynda Isaaka
4. Conrad Wanyama
5. Vaishnavi Menon
6. Alastair Denniston
7. Xiaoxuan Liu
8. Mira Emmanuel-Fabula
9. Gwydion Williams
10. Bilal A. Mateen
11. Ambrose Agweyu
This article has no evaluationsLatest version Oct 27, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Does LLM Assistance Improve Healthcare Delivery? An Evaluation Using On-site Physicians and Laboratory Tests∗

Performance of GPT-5, DeepSeek, and Claude in Dental MCQs for Medically Compromised Patients

Benchmarking Large Language Models and Clinicians Using Locally Generated Primary Healthcare Vignettes in Kenya