Probing the Surgical Competence of LLMs: A global health study leveraging AfriMedQA benchmarks
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Global surgical care faces a severe workforce shortage, with more than 1.2 million additional specialists needed by 2030, particularly in low- and middle-income countries (LMICs). Large language models (LLMs) have demonstrated impressive medical reasoning on standardized exams, but their safety, reliability, and specialty-specific performance—especially in procedural fields such as surgery—remain uncertain. Here we evaluate over 40 state-of-the-art LLMs on 3,900 expert-authored multiple-choice questions across 32 medical specialties from the AfriMed-QA benchmark, developed by 20 African medical professors. Top models (o1, GPT-4o, Claude 3.5) achieved mean accuracies exceeding 82%, showing strong diagnostic reasoning, yet consistently underperformed in surgery, pathology, and obstetrics compared with medical disciplines. Error analyses revealed frequent procedural reasoning failures, omission of local clinical guidelines, and overconfident but incorrect answers. Smaller or biomedical models exhibited higher hallucination and formatting error rates, while prompting strategies had inconsistent benefits. These results highlight the uneven readiness of LLMs for specialty-specific decision support and underscore the need for locally grounded evaluation frameworks, improved instruction tuning, and rigorous real-world validation to ensure the safe and equitable deployment of AI-assisted clinical tools in LMICs.