Large language models for automatable real-world performance monitoring of diagnostic decision support systems: a comparison to manual doctor panel review in a prospective clinical study

Fabienne Cotte
Marcel Schmude
Philipp Bode
Oula Suliman
Filipa Dias Lourenço
Miguel Paiva Pereira
Nisha Kini
Vera Hartenstein
Allesandro Muscoloni
Lisa Stroux
Victor Hertz
Sebastian Köhler
Valerio Morelli
Henry Hoffmann
Peter Engerer
Stephen Gilbert
Kirsten Gray
Tauseef Mehrali
Micaela Seemann Monteiro
Pedro Flores

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

Diagnostic decision support systems (DDSS) are increasingly deployed at scale, yet their diagnostic accuracy is insufficiently monitored once integrated into care. Traditional post-market surveillance relies on clinician review, which is costly, slow, and difficult to sustain. Large language models (LLMs) may offer a scalable and potentially automatable solution, but their performance in real-world monitoring remains unknown.

Methods

We conducted a diagnostic accuracy substudy within ESSENCE, a prospective evaluation of Ada Health’s DDSS integrated into Portugal’s largest private healthcare network. Clinical notes and ICD-10 diagnoses from 498 encounters were anonymised and classified using a filter–map–match framework. Manual clinician review served as the reference standard. We compared eligibility classification and condition mapping between clinicians and GPT-4.1 and GPT-5, and assessed diagnostic accuracy of two DDSS versions using both reference sets.

Findings

Manual review classified 385 of 498 encounters (77·3%) as eligible for diagnostic comparison. GPT-5 reproduced these classifications with 84·7% accuracy (κ=0·57), showing high sensitivity but only moderate specificity. Among 347 encounters judged eligible by both approaches, GPT-5 exactly matched clinician-assigned diagnoses in 93·6% and proposed clinically plausible alternatives in 3·5%. Diagnostic accuracy estimates based on manual versus GPT-5 mappings were statistically indistinguishable at Top-1 and Top-3 across the full analyzable sets, with one significant difference at Top-5. In the overlapping 346 cases, no statistical differences were observed. Across both reference sets, the experimental DDSS version outperformed the original only at the Top-5 threshold.

Interpretation

LLMs can reproduce clinician review of real-world diagnostic encounters with close agreement. While GPT-5 performed comparably to clinicians for condition mapping, the eligibility filtering step - deciding which encounters should enter the diagnostic-accuracy analysis - remains the main source of divergence and is the priority for improvement. Embedding such approaches into health systems could enable automated and continuous performance and safety monitoring and support regulatory compliance. Broader evaluations across diverse care settings are needed to establish generalisability and equity impact.

Funding

German Federal Ministry of Education and Research (NextGenerationEU, PATH project).

Version published to 10.1101/2025.09.20.25336227 on medRxiv
Sep 21, 2025

Beyond Accuracy: An Efficiency- and Safety-Aware Framework for Evaluating Clinical AI with Large Language Models

This article has 4 authors:
1. Nazar Zaki
2. Amal Akor
3. Salahdein Aburuz
4. Sham ZainAlAbdin
This article has no evaluationsLatest version Oct 19, 2025
Clinical Agents Don’t Care

This article has 10 authors:
1. Eyal Klang
2. Benjamin S Glicksberg
3. Alon Gorenshtein
4. Nicholas Gavin
5. Robert Freeman
6. Lisa Stump
7. Alexander W Charney
8. Daniel Shu Wei Ting
9. Mahmud Omar
10. Girish N Nadkarni
This article has no evaluationsLatest version Oct 19, 2025
From Clinical Judgment to Large Language Models: Benchmarking Predictive Approaches for Unplanned Hospital Admissions

This article has 2 authors:
1. Bernardo Neves
2. Mário J. Silva
This article has no evaluationsLatest version Sep 12, 2025

Discuss this preprint

Listed in

Abstract

Background

Methods

Findings

Interpretation

Funding

Article activity feed

Related articles

Beyond Accuracy: An Efficiency- and Safety-Aware Framework for Evaluating Clinical AI with Large Language Models

Clinical Agents Don’t Care

From Clinical Judgment to Large Language Models: Benchmarking Predictive Approaches for Unplanned Hospital Admissions