Comparison of AI-Generated Radiology Impressions: A Multi-Stakeholder Evaluation

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Objective To evaluate the quality, safety, and clinical utility of AI-generated radiology impressions compared with human-authored impressions across multiple clinical stakeholder groups. Materials & Methods A retrospective, blinded evaluation was conducted using 200 oncologic computed-tomography reports from a U.S. academic cancer center. Three impression types were assessed for each report: original radiologist-authored impressions, impressions generated by a custom domain-specific AI model fine-tuned on institutional data, and impressions generated by a general-purpose large language model. Original authoring radiologists, independent radiologists, and oncologists evaluated impressions using structured Likert-scale metrics assessing completeness, correctness, conciseness, clarity, clinical utility, and potential patient harm. Pairwise comparisons were performed using Wilcoxon signed-rank and two-proportion z-tests. Results Custom model AI impressions demonstrated near parity with human-authored impressions across most quality metrics. Original radiologists rated their own impressions as slightly more complete, while independent radiologists showed no significant differences between original and custom model impressions. Generic model impressions were longer, rated as more complete but significantly less concise. Patient harm ratings were uniformly low. Radiologists preferred original and custom model impressions over generic model impressions, whereas oncologists showed no significant preference. Discussion Evaluation outcomes varied by stakeholder group, highlighting differing priorities between radiologists and oncologists. Low inter-rater agreement across several quality metrics suggests that impression quality is inherently subjective and context dependent rather than defined by a single objective standard. Conclusion AI-generated radiology impressions, particularly those produced by custom domain-specific models, can achieve quality and safety comparable to human-authored impressions. These findings support the use of AI as an adaptable drafting aid that complements radiologist judgment.

Article activity feed