Human-AI Collaboration in Clinical Reasoning: A UK Replication and Interaction Analysis

J Healy
J Kossoff
M Lee
C Hasford

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Objective

A paper from Goh et al found that a large language model (LLM) working alone outperformed American clinicians assisted by the same LLM in diagnostic reasoning tests [1]. We aimed to replicate this result in a UK setting and explore how interactions with the LLM might explain the observed gaps in performance.

Methods and Analysis

This was a within-subjects study of UK physicians. 22 participants answered structured questions on 4 clinical vignettes. For 2 cases physicians had access to an LLM via a custom-built web-application. Results were analysed using a mixed-effects model accounting for case difficulty and the variability of clinicians at baseline. Qualitative analysis involved coding of participant-LLM interaction logs and evaluating the rates of LLM use per question.

Results

Physicians with LLM assistance scored significantly lower than the LLM alone (mean difference 21.3 percentage points, p < 0.001). Access to the LLM was associated with improved physician performance compared to using conventional resources (73.7% vs 66.3%, p = 0.001). There was significant heterogeneity in the degree of LLM-assisted improvement (SD 10.4%). Qualitative analysis revealed that only 30% of case questions were directly posed to the LLM, which suggests that under-utilisation of the LLM contributed to the observed performance gap.

Conclusion

While access to an LLM can improve diagnostic accuracy, realising the full potential of human-AI collaboration may require a focus on training clinicians to integrate these tools into their cognitive workflows and on designing systems that make these integrations the default rather than an optional extra.

Version published to 10.1101/2025.08.25.25334383 on medRxiv
Aug 27, 2025

Automation Bias in Large Language Model Assisted Diagnostic Reasoning Among AI-Trained Physicians

This article has 6 authors:
1. Ihsan Ayyub Qazi
2. Ayesha Ali
3. Asad Ullah Khawaja
4. Muhammad Junaid Akhtar
5. Ali Zafar Sheikh
6. Muhammad Hamad Alizai
This article has no evaluationsLatest version Aug 26, 2025
Designing Psychometric Measures for LLMs: Framework and Application to Racial Bias

This article has 1 author:
1. Mouhacine Benosman
This article has no evaluationsLatest version Oct 3, 2025
Designing Psychometric Measures for LLMs: Framework and Application to Racial Bias

This article has 1 author:
1. Mouhacine Benosman
This article has no evaluationsLatest version Oct 3, 2025

Discuss this preprint

Listed in

Abstract

Objective

Methods and Analysis

Results

Conclusion

Article activity feed

Related articles

Automation Bias in Large Language Model Assisted Diagnostic Reasoning Among AI-Trained Physicians

Designing Psychometric Measures for LLMs: Framework and Application to Racial Bias

Designing Psychometric Measures for LLMs: Framework and Application to Racial Bias