Can Large Language Models Reliably Interpret Radiology Reports? A Systematic Evaluation for Tumor Progression Classification
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Radiology reports, typically recorded as unstructured free text or with varying levels of structuration, contain critical information on tumor evolution but remain difficult to mine for care optimization or research without advanced language processing. We evaluated 15 open-source Large Language Models (LLMs) for classifying tumor evolution from French imaging reports, using a gold-standard corpus of 310 cases. We tested models with varied architecture, hyperparameter configuration and prompting strategy, and compared them with rule-based and BERT-based baselines. We systematically assessed development time and carbon emissions. Properly selected and configured, LLMs outperformed state-of-the-art baselines without requiring large manually annotated datasets, but used substantial computational resources. In contrast, fine-tuned BERT models, trained on high-quality annotations, achieved only slightly lower performance at reduced hardware and computational costs. Our results highlight a trade-off between human annotation effort and computational infrastructure, offering insight for transforming unstructured clinical reports into structured, actionable data.