Comparison of AI-generated radiology impressions: a multi-stakeholder evaluation

Sharang Phadke
Nivedita Suresh
Zachary Allen
Anjali Balagopal
Stephen Chan
Anish Shah
Megan Winter
Cesar Lam
Trevor Rose
Cyrillo Araujo
Abraham Ahmed
Iman Imanirad
Lincoln Berland
Andrew Del Gaizo

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

A retrospective, blinded evaluation of 200 oncologic computed tomography reports compared original radiologist-authored impressions, impressions generated by a custom domain-specific AI model fine-tuned on institutional data, and impressions generated by a general-purpose large language model. Ten clinicians, including original radiologists ( n = 4), independent radiologists ( n = 3), and oncologists ( n = 3), rated impressions for completeness, correctness, conciseness, clarity, clinical utility, and patient harm. Original and independent radiologists assigned lower preference to generic model impressions (Cohen’s h 1.04–1.22 and 0.66–0.69, p < 0.001). Original radiologists slightly preferred their own impressions to the custom model ( h = 0.18, p = 0.0716), while independent radiologists showed no preference ( h = −0.03, p = 0.78). Oncologists demonstrated no significant preference among impression types ( h = 0.04–0.12, all p > 0.20). Custom model impressions achieved near parity with human impressions; original radiologists rated their own impressions slightly more complete ( r = 0.22, p = 0.0016). Generic model impressions were longer (75.1 ± 20.4 words), slightly more complete ( r = 0.18–0.39, p < 0.001–0.01), but significantly less concise ( r = 0.85–0.87, p < 0.001). Patient harm ratings were uniformly low (likelihood 1.01–1.14; extent 1.05–1.21). Inter-rater reliability ranged from −0.09 to 0.67 ( α = 0.67 conciseness; α = −0.09–0.03 clinical utility/correctness).

Version published to 10.1038/s41746-026-02586-6
Apr 4, 2026
Version published to 10.21203/rs.3.rs-8476600/v1 on Research Square
Jan 14, 2026

Concordance Between the DeepSeek-V3 Language Model and Multidisciplinary Team Recommendations in Lung Cancer: A Retrospective Study

This article has 7 authors:
1. Yihan ZHao
2. Fangqi Yuan
3. Lingli Wang
4. Meifang Wang
5. Long Zhang
6. Tao Ren
7. Hansheng Wang
This article has no evaluationsLatest version Apr 10, 2026
Development and psychometric validation of a patient-reported core symptom set for lung cancer patients undergoing chemotherapy

This article has 10 authors:
1. Ye Yang
2. Juan Li
3. Cheng Lei
4. Xiangyu Tan
5. Min Zheng
6. Xiaoling Liu
7. Zixin Wei
8. Sudan Zheng
9. Qiuling Shi
10. Xi Luo
This article has no evaluationsLatest version Apr 14, 2026
Accuracy and Safety of an AI Ambient Scribe Compared with Handwritten Clinical Notes

This article has 6 authors:
1. Byron De John
2. Johannes M.N Enslin
3. Joshua Fieggen
4. Linda Camara
5. Bruce Bassett
6. Graham Fieggen
This article has no evaluationsLatest version Apr 15, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Concordance Between the DeepSeek-V3 Language Model and Multidisciplinary Team Recommendations in Lung Cancer: A Retrospective Study

Development and psychometric validation of a patient-reported core symptom set for lung cancer patients undergoing chemotherapy

Accuracy and Safety of an AI Ambient Scribe Compared with Handwritten Clinical Notes