Evaluating Generative AI as an Educational Tool for Radiology Resident Report Drafting

Antonio Verdone
Aidan Cardall
Fardeen Siddiqui
Motaz Nashawaty
Danielle Rigau
Youngjoon Kwon
Mira Yousef
Shalin Patel
Alex Kieturakis
Eric Kim
Laura Heacock
Beatriu Reig
Yiqiu Shen

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Objective

Radiology residents require timely, personalized feedback to develop accurate image analysis and reporting skills. Increasing clinical workload often limits attendings’ ability to provide guidance. This study evaluates a HIPAA-compliant GPT-4o system that delivers automated feedback on breast imaging reports drafted by residents in real clinical settings.

Methods

We analyzed 5,000 resident–attending report pairs from routine practice at a multi-site U.S. health system. GPT-4o was prompted with clinical instructions to identify common errors and provide feedback. A reader study using 100 report pairs was conducted. Four attending radiologists and four residents independently reviewed each pair, determined whether predefined error types were present, and rated GPT-4o’s feedback as helpful or not. Agreement between GPT and readers was assessed using percent match. Inter-reader reliability was measured with Krippendorff’s alpha. Educational value was measured as the proportion of cases rated helpful.

Results

Three common error types were identified: (1) omission or addition of key findings, (2) incorrect use or omission of technical descriptors, and (3) final assessment inconsistent with findings. GPT-4o showed strong agreement with attending consensus: 90.5%, 78.3%, and 90.4% across error types. Inter-reader reliability showed moderate variability (α = 0.767, 0.595, 0.567), and replacing a human reader with GPT-4o did not significantly affect agreement (Δ = –0.004 to 0.002). GPT’s feedback was rated helpful in most cases: 89.8%, 83.0%, and 92.0%.

Discussion

ChatGPT-4o can reliably identify key educational errors. It may serve as a scalable tool to support radiology education.

Version published to 10.1101/2025.10.06.25337417 on medRxiv
Oct 8, 2025

AI Performance on Image-based Medical Case Scenarios: A Cross-Sectional Comparative Study

This article has 6 authors:
1. Jia-Wei Liu
2. Yue-Tong Qian
3. Xiao Ma
4. Jun-Ping Fan
5. Lan-Wei Guo
6. Hong-Bo Yang
This article has no evaluationsLatest version Dec 13, 2025
Comparison of AI-Generated Radiology Impressions: A Multi-Stakeholder Evaluation

This article has 14 authors:
1. Sharang Phadke
2. Nivedita Suresh
3. Zachary Allen
4. Anjali Balagopal
5. Stephen Chan
6. Anish Shah
7. Megan Winter
8. Cesar Lam
9. Trevor Rose
10. Cyrillo Araujo
11. Abraham Ahmed
12. Iman Imanirad
13. Lincoln Berland
14. Andrew Del Gaizo
This article has no evaluationsLatest version Jan 14, 2026
Quantifying Learning Curves in Ultrasound Training: A Real-Time Consultation Analysis Using a Novel Half-Life Metric

This article has 5 authors:
1. Ying Wang
2. Yahong Wang
3. Xiao Yang
4. Sheng Cai
5. Li Jianchu
This article has no evaluationsLatest version Jan 9, 2026

Discuss this preprint

Listed in

Abstract

Objective

Methods

Results

Discussion

Article activity feed

Related articles

AI Performance on Image-based Medical Case Scenarios: A Cross-Sectional Comparative Study

Comparison of AI-Generated Radiology Impressions: A Multi-Stakeholder Evaluation

Quantifying Learning Curves in Ultrasound Training: A Real-Time Consultation Analysis Using a Novel Half-Life Metric