Toward Accurate and Actionable Differential Diagnosis with Lean LLM Orchestration
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Large language models (LLMs) can assist clinicians with diagnostic reasoning, yet their autonomous diagnostic performance remains uncertain. We evaluated OpenMedicine AI, an LLM-powered diagnostic agent with a deterministic controller, on 302 New England Journal of Medicine Clinicopathological Conference (CPC) cases, a benchmark renowned for diagnostic difficulty. Models produced ranked differential-diagnosis lists. Accuracy was assessed by inclusion of the ground-truth diagnosis within the Top-n list (Top-n accuracy) and by Capture@K, an actionability metric that is “captured” if any of the Top-n differentials would appropriately lead a clinician to order the diagnostic test of record (DToR) or its immediate precursor.
Across 302 CPCs, OpenMedicine AI achieved 46.0% Top-1 and 79.1% Top-10 accuracy, outperforming AMIE (32.5%, 68.9%) and physicians (15.6%, 20.9%). Paired McNemar tests confirmed superiority at all thresholds (p < 10 -5 ). For actionability, at Capture@10 it matched or exceeded AMIE in 97.0% of cases and physicians in 96.7%. It rescued 99 of 302 cases missed by physicians (odds ratio [OR] 16.5) and 44 missed by AMIE (OR 7.3), reducing misses by 31 and 13 per 100 cases, respectively. These gains correspond to a number needed to assess (NNA) of 3.21 versus physicians and 7.95 versus AMIE. A safety margin was evident already at Capture@3, with rescues outnumbering failures to rescue versus physicians (109 vs 15; OR 7.27; 95% CI, 4.24 to 12.47; p=8.7×10 -19 ) and versus AMIE (61 vs 15; OR 4.07; 95% CI, 2.31 to 7.15; p=9.84×10 -8 ), corresponding to 31 and 15 fewer misses per 100 cases, respectively.
These findings indicate that a lightweight, deterministic controller layered over state-of-the-art LLMs can narrow the gap between diagnostic recall and clinical actionability. By producing high-quality differentials and prioritizing rational next tests, this approach offers a scalable, resource-efficient path to improved diagnostic performance in high-complexity clinical scenarios.