Performance of a Large Language Model in BI-RADS Classification of Ultrasound Based Breast Lesions
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Aims: Given the advent of large language models (LLMs), the number of potential applications using artificial intelligence technologies in radiology has rapidly increased. Recently, several studies have evaluated the accuracy and quality of LLMs to characterize CT and MRI scans. Yet, to our knowledge, there have been few studies that have reported the utility of these models in generating BI-RADS assessment categories. Methods: A breast ultrasound dataset including 256 images from 256 patients manually interpreted and labeled by radiologists according to BI-RADS features and lexicon was used for evaluating Gemini 2.0 Flash. We prompted the model to assess images in individual context windows and tested it with two variations of the original prompt (n = 3). Statistical analyses were then performed comparing the abilities of the model to the ground truth. The receiver operating characteristic-area under the curve (ROC-AUC) analysis was then calculated for each classification type from individual replicates. Results: We found that the overall accuracy of Gemini 2.0 was 19.01% in predicting the BI-RADS classification of the breast lesions, and those of each category did not significantly differ from one another. From the ROC-AUC analysis, all category scores ranged from 0.5-0.6, and found that the model performed slightly better at categorizing benign lesions (1-4a), while those of greater probability of malignancy were akin to random chance (4b-5). Furthermore, we found that among incorrect predictions, the model was generally within 1-2 categories away from the true classification, demonstrating a low precision unreliable for realistic clinical usage. Conclusions: This work highlights the current limitations of artificial intelligence models in classifying clinical images, and further development is required in these technologies before translation into the clinical setting. To our knowledge, this is the first study to report the capabilities of LLMs in performing BI-RADS classification of breast lesions with replicates.