Automated ACR TI-RADS Classification of Thyroid Nodules from Narrative Ultrasound Reports Using a Fine-Tuned Open-Source Language Model: A Reproducible and Low-Resource Framework

Miao Yu
Sijia Huang
Muyang Li
Likuan Zhang
Heng Zhang
Qiao Xu
Zikang Wang
Jian Gao

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background: Manual ACR TI-RADS classification from narrative ultrasound reports is a key component of thyroid nodule risk stratification but is laborious and subject to inter-observer variability. While Large Language Models (LLMs) offer potential solutions, existing approaches often rely on proprietary models or require extensive computational resources, limiting widespread adoption. This study aimed to develop and validate a reproducible, low-resource framework using a fine-tuned open-source LLM to automate this task. Methods: This retrospective study utilized a dataset of 1,850 de-identified thyroid ultrasound reports from a primary single center. The reports were annotated by radiologists to establish a ground truth. An open-source 7-billion parameter model (Qwen1.5-7B) was fine-tuned on a training set (n=1,480) using Low-Rank Adaptation (LoRA) on a single consumer-grade GPU. The model's performance was evaluated on a hold-out internal test set (n=370) and a separate external validation set (n=210) from another institution. Results: On the internal test set, the fine-tuned model achieved an overall accuracy of 93.0% and a macro-averaged F1-score of 0.950. On the external validation set, it maintained robust performance with an accuracy of 88.6% and a macro F1-score of 0.891, demonstrating strong generalizability. It significantly outperformed both a zero-shot LLM baseline and a traditional machine learning model (TF-IDF with SVM) on both datasets. Conclusions: Fine-tuning an accessible, open-source language model on local, consumer-grade hardware is an effective and resource-efficient strategy for automating ACR TI-RADS classification from narrative reports. This approach offers a practical and generalizable blueprint for healthcare institutions to develop bespoke AI tools, potentially enhancing workflow efficiency and diagnostic consistency while preserving data privacy.

Version published to 10.21203/rs.3.rs-7864505/v1 on Research Square
Nov 4, 2025

Can Large Language Models Reliably Interpret Radiology Reports? A Systematic Evaluation for Tumor Progression Classification

This article has 8 authors:
1. Valentin POHYER
2. Constance de Margerie-Mellon
3. Laetitia PERRONNE
4. Loïc DURON
5. Constance THIBAULT
6. Stéphane Oudard
7. Laure FOURNIER
8. Bastien Rance
This article has no evaluationsLatest version Sep 23, 2025
Large language models in radiologic numerical tasks: A thorough evaluation and error analysis

This article has 6 authors:
1. Ali Nowroozi
2. Masha Bondarenko
3. Adrian Serapio
4. Tician Schnitzler
5. Sukhmanjit S Brar
6. Jae Ho Sohn
This article has no evaluationsLatest version Oct 21, 2025
Prompt Engineering in Large Language Models for BI-RADS Classification of Imaging Reports: A Retrospective Evaluation

This article has 8 authors:
1. Wenjie Liu
2. Hailong Wu
3. Yuanyuan Lang
4. Yan Luo
5. Yan Li
6. Xinyi Liu
7. Yinping Leng
8. Lianggeng Gong
This article has no evaluationsLatest version Oct 19, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Can Large Language Models Reliably Interpret Radiology Reports? A Systematic Evaluation for Tumor Progression Classification

Large language models in radiologic numerical tasks: A thorough evaluation and error analysis

Prompt Engineering in Large Language Models for BI-RADS Classification of Imaging Reports: A Retrospective Evaluation