Large language models for automatable real-world performance monitoring of diagnostic decision support systems: a comparison to manual doctor panel review in a prospective clinical study
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background
Diagnostic decision support systems (DDSS) are increasingly deployed at scale, yet their diagnostic accuracy is insufficiently monitored once integrated into care. Traditional post-market surveillance relies on clinician review, which is costly, slow, and difficult to sustain. Large language models (LLMs) may offer a scalable and potentially automatable solution, but their performance in real-world monitoring remains unknown.
Methods
We conducted a diagnostic accuracy substudy within ESSENCE, a prospective evaluation of Ada Health’s DDSS integrated into Portugal’s largest private healthcare network. Clinical notes and ICD-10 diagnoses from 498 encounters were anonymised and classified using a filter–map–match framework. Manual clinician review served as the reference standard. We compared eligibility classification and condition mapping between clinicians and GPT-4.1 and GPT-5, and assessed diagnostic accuracy of two DDSS versions using both reference sets.
Findings
Manual review classified 385 of 498 encounters (77·3%) as eligible for diagnostic comparison. GPT-5 reproduced these classifications with 84·7% accuracy (κ=0·57), showing high sensitivity but only moderate specificity. Among 347 encounters judged eligible by both approaches, GPT-5 exactly matched clinician-assigned diagnoses in 93·6% and proposed clinically plausible alternatives in 3·5%. Diagnostic accuracy estimates based on manual versus GPT-5 mappings were statistically indistinguishable at Top-1 and Top-3 across the full analyzable sets, with one significant difference at Top-5. In the overlapping 346 cases, no statistical differences were observed. Across both reference sets, the experimental DDSS version outperformed the original only at the Top-5 threshold.
Interpretation
LLMs can reproduce clinician review of real-world diagnostic encounters with close agreement. While GPT-5 performed comparably to clinicians for condition mapping, the eligibility filtering step - deciding which encounters should enter the diagnostic-accuracy analysis - remains the main source of divergence and is the priority for improvement. Embedding such approaches into health systems could enable automated and continuous performance and safety monitoring and support regulatory compliance. Broader evaluations across diverse care settings are needed to establish generalisability and equity impact.
Funding
German Federal Ministry of Education and Research (NextGenerationEU, PATH project).