Prompt Engineering in Large Language Models for BI-RADS Classification of Imaging Reports: A Retrospective Evaluation

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Purpose To evaluate how prompt engineering modulates large language models' (LLMs) accuracy in Breast Imaging Reporting and Data System (BI-RADS) classification of digital breast tomosynthesis (DBT) reports. Materials and Methods This retrospective study collected reports from 216 patients who underwent DBT for breast cancer screening or diagnosis. BI-RADS classifications were independently assigned to all reports by two experts. Three LLMs (GPT-4o, GPT-o3 mini, Qwen-2.5 max) were utilized to classify all reports using different prompts. Besides, six human readers independently assigned BI-RADS classifications. Agreement between experts and LLMs for BI-RADS categories was evaluated using Weighted Cohen’s kappa (κw). Friedman and Nemenyi tests assessed κw differences among three prompt conditions.The frequencies of changed BI-RADS category assignments, which could impact clinical management, were also calculated. Results In prompt III, GPT-4o achieved near-perfect agreement with experts (κw = 0.80), surpassing GPT-o3 mini (0.76) and Qwen-2.5 max (0.79). Its κw was significantly higher in prompt III than in prompt II (0.69, P, P  < 0.05) and prompt I (0.63,, P  < 0.01). While GPT-4o's κw remained lower than two mid-level radiologists (0.89 and 0.86), it exceeded two entry-level radiologists (0.76 and 0.79). Regarding clinical management changes, prompt III yielded a 14.8% discordance rate with experts, outperforming prompts I (29.6%) and II (28.2%), and aligning with entry-level radiologists (15.3%, 14.4%). Conclusion With optimized prompts, GPT-4o achieved near-perfect agreement and matched the clinical management performance of entry-level radiologists. These findings support the use of LLMs as an auxiliary tool for BI-RADS classification in breast cancer diagnosis by radiologists.

Article activity feed