From Tool to Teammate: A Randomized Controlled Trial of Clinician-AI Collaborative Workflows for Diagnosis

Selin S. Everett
Bryan J. Bunning
Priyank Jain
Ivan Lopez
Anup Agarwal
Manisha Desai
Robert Gallo
Ethan Goh
Vinay B. Kadiyala
Zahir Kanjee
Jacob M. Koshy
Andrew Olson
Adam Rodman
Kevin Schulman
Eric Strong
Jonathan H. Chen
Eric Horvitz

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Early studies of large language models (LLMs) in clinical settings have largely treated artificial intelligence (AI) as a tool rather than an active collaborator. As LLMs now demonstrate expert-level diagnostic performance, the focus shifts from whether AI can offer valuable suggestions to how it can be effectively integrated into physicians’ diagnostic workflows. We conducted a randomized controlled trial (n=70 clinicians) to evaluate the value of employing a custom GPT system designed to engage collaboratively with clinicians on diagnostic reasoning challenges. The collaborative design began with independent diagnostic assessments from both the clinician and the AI. These were then combined in an AI-generated synthesis that integrated the two perspectives, highlighting points of agreement and disagreement and offering commentary on each. We evaluated two workflow variants: one where the AI provided an initial opinion (AI-first), and another where it followed the clinician’s assessment (AI-second). Clinicians using either collaborative workflow outperformed those using traditional tools, achieving average accuracies of 85% (AI-first) and 82% (AI-second), compared to 75% with traditional resources (p < 0.0004 and p < 0.00001; mean differences = 9.8% and 6.8%; 95% CI = 4.6%–15% and 4.0%–9.6%). Performance did not differ significantly between workflows or from the AI-alone score of 90%. These results underscore the value of collaborative AI systems that complement clinician expertise and foster effective coordination between human and machine reasoning in diagnostic decision-making.

Version published to 10.1101/2025.06.07.25329176 on medRxiv
Jun 8, 2025

Screenathon 2.0: Human–AI Collaborative Screening Applied to Patient-Generated Health Data

This article has 11 authors:
1. Jonas Bergmann
2. Tiago Azzi
3. Rutger Chris Neeleman
4. Kianush Monschau
5. Elena Jalsovec
6. Emily Westerbeek
7. Felix Weijdema
8. Jonathan de Bruin
9. Qixiang Fang
10. Rens van de Schoot
11. Berke Yazan
This article has no evaluationsLatest version Jan 9, 2026
How to Evaluate Medical AI

This article has 8 authors:
1. Ilia Kopanichuk
2. Petr Anokhin
3. Vladimir Shaposhnikov
4. Vladimir Makharev
5. Ekaterina Tsapieva
6. Iaroslav Bespalov
7. Dmitry Dylov
8. Ivan Oseledets
This article has no evaluationsLatest version Jan 22, 2026
A hybrid-reasoner LLM framework toward real-world clinical decision- making support in acute ischemic stroke

This article has 14 authors:
1. Bicong Yan
2. Ruipeng Zhang
3. Yanfeng Fan
4. Ying Li
5. Li Chen
6. Xinyu Song
7. Yixiao Tang
8. Yifan Tu
9. Zhongzheng Cao
10. Li Shen
11. Mengfei Wang
12. Zhuo Li
13. Yijia Xiong
14. Yue-Hua LI
This article has no evaluationsLatest version Dec 11, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Screenathon 2.0: Human–AI Collaborative Screening Applied to Patient-Generated Health Data

How to Evaluate Medical AI

A hybrid-reasoner LLM framework toward real-world clinical decision- making support in acute ischemic stroke