Performance of a Large Language Model in BI-RADS Classification of Ultrasound Based Breast Lesions

Kathryn Pillai
Fauzia Nausheen

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Aims

Given the advent of large language models (LLMs), the number of potential applications using artificial intelligence technologies in radiology has rapidly increased. Recently, several studies have evaluated the accuracy and quality of LLMs to characterize CT and MRI scans. Yet, to our knowledge, there have been few studies that have reported the utility of these models in generating BI-RADS assessment categories.

Methods

A breast ultrasound dataset including 256 images from 256 patients manually interpreted and labeled by radiologists according to BI-RADS features and lexicon was used for evaluating Gemini 2.0 Flash. We prompted the model to assess images in individual context windows and tested it with two variations of the original prompt ( n = 3). Statistical analyses were then performed comparing the abilities of the model to the ground truth. The receiver operating characteristic-area under the curve (ROC-AUC) analysis was then calculated for each classification type from individual replicates.

Results

We found that the overall accuracy of Gemini 2.0 was 19.01% in predicting the BI-RADS classification of the breast lesions, and those of each category did not significantly differ from one another. From the ROC-AUC analysis, all category scores ranged from 0.5-0.6, and found that the model performed slightly better at categorizing benign lesions (1-4a), while those of greater probability of malignancy were akin to random chance (4b-5). Furthermore, we found that among incorrect predictions, the model was generally within 1-2 categories away from the true classification, demonstrating a low precision unreliable for realistic clinical usage.

Conclusions

This work highlights the current limitations of artificial intelligence models in classifying clinical images, and further development is required in these technologies before translation into the clinical setting. To our knowledge, this is the first study to report the capabilities of LLMs in performing BI-RADS classification of breast lesions with replicates.

Version published to 10.1101/2025.09.28.25336860 on medRxiv
Oct 1, 2025

Diagnostic Comparison of TI-RADS and a Nomogram for Thyroid Nodules in Northwestern China

This article has 5 authors:
1. Miao Tan
2. Wenhan Li
3. Jianhui Li
4. Jia Du
5. Xufeng Zhang
This article has no evaluationsLatest version Dec 30, 2025
Smart Diagnosis: AI and ML Powered Breast Cancer Classification

This article has 2 authors:
1. Sagar Verma
2. Vaibhav Sabale
This article has no evaluationsLatest version Jan 28, 2026
AI Performance on Image-based Medical Case Scenarios: A Cross-Sectional Comparative Study

This article has 6 authors:
1. Jia-Wei Liu
2. Yue-Tong Qian
3. Xiao Ma
4. Jun-Ping Fan
5. Lan-Wei Guo
6. Hong-Bo Yang
This article has no evaluationsLatest version Dec 13, 2025

Discuss this preprint

Listed in

Abstract

Aims

Methods

Results

Conclusions

Article activity feed

Related articles

Diagnostic Comparison of TI-RADS and a Nomogram for Thyroid Nodules in Northwestern China

Smart Diagnosis: AI and ML Powered Breast Cancer Classification

AI Performance on Image-based Medical Case Scenarios: A Cross-Sectional Comparative Study