Large language models in radiologic numerical tasks: A thorough evaluation and error analysis

Ali Nowroozi
Masha Bondarenko
Adrian Serapio
Tician Schnitzler
Sukhmanjit S Brar
Jae Ho Sohn

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Purpose

To investigate the performance of LLMs in radiology numerical tasks and perform a comprehensive error analysis.

Materials and Methods

We defined six tasks: extracting 1-minimum T-score from DEXA report, 2-maximum common bile duct (CBD) diameter from ultrasound report, and 3-maximum lung nodule size from CT report, and judging 1-presence of a highly hypermetabolic region on a PET report, 2-whether a patient is osteoporotic based on a DEXA report, and 3-whether a patient has a dilated CBD based on an ultrasound report. Reports were extracted from the MIMIC III and our institution’s databases, and the ground truths were extracted manually. The models used were Llama 3.1 8b, DeepSeek R1 distilled Llama 8b, OpenAI o1-mini, and OpenAI GPT-5-mini. We manually reviewed all incorrect outputs and performed a comprehensive error analysis.

Results

In extraction tasks, while Llama showed relatively variable results (ranging 86%-98.7%) across tasks, other models performed consistently well (accuracies >95%). In judgement tasks, the lowest accuracies of Llama, DeepSeek, o1-mini, and GPT-5-mini were 62.0%, 91.7%, 91.7%, and 99.0%, respectively, while o1-mini and GPT-5-mini did reach 100% performance in detecting osteoporosis. We found no mathematical errors in the outputs of o1-mini and GPT-5-mini. Answer-only output format significantly reduced performance in Llama and DeepSeek but not in o1-mini or GPT-5-mini.

Conclusion

True reasoning models perform consistently well in radiology numerical tasks and show no mathematical errors. Simpler non-true reasoning models may also achieve acceptable performance depending on the task.

Version published to 10.1101/2025.10.16.25337607 on medRxiv
Oct 21, 2025

Performance of a Large Language Model in BI-RADS Classification of Ultrasound Based Breast Lesions

This article has 2 authors:
1. Kathryn Pillai
2. Fauzia Nausheen
This article has no evaluationsLatest version Oct 1, 2025
Can Large Language Models Reliably Interpret Radiology Reports? A Systematic Evaluation for Tumor Progression Classification

This article has 8 authors:
1. Valentin POHYER
2. Constance de Margerie-Mellon
3. Laetitia PERRONNE
4. Loïc DURON
5. Constance THIBAULT
6. Stéphane Oudard
7. Laure FOURNIER
8. Bastien Rance
This article has no evaluationsLatest version Sep 23, 2025
Automated ACR TI-RADS Classification of Thyroid Nodules from Narrative Ultrasound Reports Using a Fine-Tuned Open-Source Language Model: A Reproducible and Low-Resource Framework

This article has 8 authors:
1. Miao Yu
2. Sijia Huang
3. Muyang Li
4. Likuan Zhang
5. Heng Zhang
6. Qiao Xu
7. Zikang Wang
8. Jian Gao
This article has no evaluationsLatest version Nov 4, 2025

Discuss this preprint

Listed in

Abstract

Purpose

Materials and Methods

Results

Conclusion

Article activity feed

Related articles

Performance of a Large Language Model in BI-RADS Classification of Ultrasound Based Breast Lesions

Can Large Language Models Reliably Interpret Radiology Reports? A Systematic Evaluation for Tumor Progression Classification

Automated ACR TI-RADS Classification of Thyroid Nodules from Narrative Ultrasound Reports Using a Fine-Tuned Open-Source Language Model: A Reproducible and Low-Resource Framework