Comparison of AI-generated radiology impressions: a multi-stakeholder evaluation

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

A retrospective, blinded evaluation of 200 oncologic computed tomography reports compared original radiologist-authored impressions, impressions generated by a custom domain-specific AI model fine-tuned on institutional data, and impressions generated by a general-purpose large language model. Ten clinicians, including original radiologists ( n  = 4), independent radiologists ( n  = 3), and oncologists ( n  = 3), rated impressions for completeness, correctness, conciseness, clarity, clinical utility, and patient harm. Original and independent radiologists assigned lower preference to generic model impressions (Cohen’s h 1.04–1.22 and 0.66–0.69, p  < 0.001). Original radiologists slightly preferred their own impressions to the custom model ( h  = 0.18, p = 0.0716), while independent radiologists showed no preference ( h  = −0.03, p = 0.78). Oncologists demonstrated no significant preference among impression types ( h  = 0.04–0.12, all p  > 0.20). Custom model impressions achieved near parity with human impressions; original radiologists rated their own impressions slightly more complete ( r  = 0.22, p  = 0.0016). Generic model impressions were longer (75.1 ± 20.4 words), slightly more complete ( r  = 0.18–0.39, p  < 0.001–0.01), but significantly less concise ( r  = 0.85–0.87, p  < 0.001). Patient harm ratings were uniformly low (likelihood 1.01–1.14; extent 1.05–1.21). Inter-rater reliability ranged from −0.09 to 0.67 ( α  = 0.67 conciseness; α  = −0.09–0.03 clinical utility/correctness).

Article activity feed