Large language models for automatable real-world performance monitoring of diagnostic decision support systems: a comparison to manual doctor panel review in a prospective clinical study

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background

Diagnostic decision support systems (DDSS) are increasingly deployed at scale, yet their diagnostic accuracy is insufficiently monitored once integrated into care. Traditional post-market surveillance relies on clinician review, which is costly, slow, and difficult to sustain. Large language models (LLMs) may offer a scalable and potentially automatable solution, but their performance in real-world monitoring remains unknown.

Methods

We conducted a diagnostic accuracy substudy within ESSENCE, a prospective evaluation of Ada Health’s DDSS integrated into Portugal’s largest private healthcare network. Clinical notes and ICD-10 diagnoses from 498 encounters were anonymised and classified using a filter–map–match framework. Manual clinician review served as the reference standard. We compared eligibility classification and condition mapping between clinicians and GPT-4.1 and GPT-5, and assessed diagnostic accuracy of two DDSS versions using both reference sets.

Findings

Manual review classified 385 of 498 encounters (77·3%) as eligible for diagnostic comparison. GPT-5 reproduced these classifications with 84·7% accuracy (κ=0·57), showing high sensitivity but only moderate specificity. Among 347 encounters judged eligible by both approaches, GPT-5 exactly matched clinician-assigned diagnoses in 93·6% and proposed clinically plausible alternatives in 3·5%. Diagnostic accuracy estimates based on manual versus GPT-5 mappings were statistically indistinguishable at Top-1 and Top-3 across the full analyzable sets, with one significant difference at Top-5. In the overlapping 346 cases, no statistical differences were observed. Across both reference sets, the experimental DDSS version outperformed the original only at the Top-5 threshold.

Interpretation

LLMs can reproduce clinician review of real-world diagnostic encounters with close agreement. While GPT-5 performed comparably to clinicians for condition mapping, the eligibility filtering step - deciding which encounters should enter the diagnostic-accuracy analysis - remains the main source of divergence and is the priority for improvement. Embedding such approaches into health systems could enable automated and continuous performance and safety monitoring and support regulatory compliance. Broader evaluations across diverse care settings are needed to establish generalisability and equity impact.

Funding

German Federal Ministry of Education and Research (NextGenerationEU, PATH project).

Article activity feed