Comparison of AI-generated radiology impressions: a multi-stakeholder evaluation
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
A retrospective, blinded evaluation of 200 oncologic computed tomography reports compared original radiologist-authored impressions, impressions generated by a custom domain-specific AI model fine-tuned on institutional data, and impressions generated by a general-purpose large language model. Ten clinicians, including original radiologists ( n = 4), independent radiologists ( n = 3), and oncologists ( n = 3), rated impressions for completeness, correctness, conciseness, clarity, clinical utility, and patient harm. Original and independent radiologists assigned lower preference to generic model impressions (Cohen’s h 1.04–1.22 and 0.66–0.69, p < 0.001). Original radiologists slightly preferred their own impressions to the custom model ( h = 0.18, p = 0.0716), while independent radiologists showed no preference ( h = −0.03, p = 0.78). Oncologists demonstrated no significant preference among impression types ( h = 0.04–0.12, all p > 0.20). Custom model impressions achieved near parity with human impressions; original radiologists rated their own impressions slightly more complete ( r = 0.22, p = 0.0016). Generic model impressions were longer (75.1 ± 20.4 words), slightly more complete ( r = 0.18–0.39, p < 0.001–0.01), but significantly less concise ( r = 0.85–0.87, p < 0.001). Patient harm ratings were uniformly low (likelihood 1.01–1.14; extent 1.05–1.21). Inter-rater reliability ranged from −0.09 to 0.67 ( α = 0.67 conciseness; α = −0.09–0.03 clinical utility/correctness).