Prompt Engineering in Large Language Models for BI-RADS Classification of Imaging Reports: A Retrospective Evaluation

Wenjie Liu
Hailong Wu
Yuanyuan Lang
Yan Luo
Yan Li
Xinyi Liu
Yinping Leng
Lianggeng Gong

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Purpose To evaluate how prompt engineering modulates large language models' (LLMs) accuracy in Breast Imaging Reporting and Data System (BI-RADS) classification of digital breast tomosynthesis (DBT) reports. Materials and Methods This retrospective study collected reports from 216 patients who underwent DBT for breast cancer screening or diagnosis. BI-RADS classifications were independently assigned to all reports by two experts. Three LLMs (GPT-4o, GPT-o3 mini, Qwen-2.5 max) were utilized to classify all reports using different prompts. Besides, six human readers independently assigned BI-RADS classifications. Agreement between experts and LLMs for BI-RADS categories was evaluated using Weighted Cohen’s kappa (κw). Friedman and Nemenyi tests assessed κw differences among three prompt conditions.The frequencies of changed BI-RADS category assignments, which could impact clinical management, were also calculated. Results In prompt III, GPT-4o achieved near-perfect agreement with experts (κw = 0.80), surpassing GPT-o3 mini (0.76) and Qwen-2.5 max (0.79). Its κw was significantly higher in prompt III than in prompt II (0.69, P, P < 0.05) and prompt I (0.63,, P < 0.01). While GPT-4o's κw remained lower than two mid-level radiologists (0.89 and 0.86), it exceeded two entry-level radiologists (0.76 and 0.79). Regarding clinical management changes, prompt III yielded a 14.8% discordance rate with experts, outperforming prompts I (29.6%) and II (28.2%), and aligning with entry-level radiologists (15.3%, 14.4%). Conclusion With optimized prompts, GPT-4o achieved near-perfect agreement and matched the clinical management performance of entry-level radiologists. These findings support the use of LLMs as an auxiliary tool for BI-RADS classification in breast cancer diagnosis by radiologists.

Version published to 10.21203/rs.3.rs-7526460/v1 on Research Square
Oct 19, 2025

Evaluating Large Language Models for Translating Caries Guidelines into Clinical Decision Support

This article has 8 authors:
1. Gu Nan
2. Bingxin Fan
3. Yao Yuan
4. Xinliang Duan
5. Sichen Han
6. Zhenyong Tang
7. Jiayu Shen
8. Zilin Wang
This article has no evaluationsLatest version Jan 28, 2026
Comparative Accuracy of Large Language Models for CPT Coding Assignments from Surgical Procedure Notes

This article has 4 authors:
1. Abdalrahman Katranji
2. Aisa De Vries
3. Abdalmajid Katranji
4. Mohammad Zalzaleh
This article has no evaluationsLatest version Jan 8, 2026
Large Language Models in Radiology Exams: A Comparative Analysis of Performance in Turkish and English

This article has 2 authors:
1. Şahinde ATLANOĞLU
2. Mehmet Ali GEDİK
This article has no evaluationsLatest version Jan 21, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Evaluating Large Language Models for Translating Caries Guidelines into Clinical Decision Support

Comparative Accuracy of Large Language Models for CPT Coding Assignments from Surgical Procedure Notes

Large Language Models in Radiology Exams: A Comparative Analysis of Performance in Turkish and English