Improving CXR Report Labeling Through LLM Fine-Tuning and Human Feedback

Donald Martin

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Automated labeling of radiological findings from free-text chest X-ray reports is a critical task for enabling large-scale clinical research and developing artificial intelligence applications in medical imaging. However, manual annotation is prohibitively expensive and time-consuming. Existing automated methods, including rule-based systems, traditional machine learning, fine-tuned smaller language models like BERT, and approaches leveraging large language models (LLMs) primarily for pseudo-label generation, face limitations in accurately capturing the complex nuances, negations, and uncertainties present in clinical narratives. Direct inference using powerful proprietary LLMs via API is often computationally expensive for large datasets. To address these challenges, we propose CheX-LLM, a novel approach that directly fine-tunes an open-source large language model as an end-to-end inference model for chest X-ray report labeling. Our method employs a two-stage training strategy: Supervised Fine-Tuning to adapt the base LLM to the task and output format, followed by Reinforcement Learning from Human Feedback (RLHF) to align the model's generated structured labels with expert radiological judgments. We evaluate CheX-LLM on the benchmark MIMIC-500 dataset and compare its performance against state-of-the-art methods, including CheXpert Labeler, CheXbert, CheX-GPT, and GPT-4. Quantitative results demonstrate that CheX-LLM achieves a state-of-the-art Macro F1 score of 0.9115, surpassing all baselines. Furthermore, a blinded human evaluation by board-certified radiologists confirms that CheX-LLM produces outputs with significantly fewer errors and a higher percentage of perfect reports. Analysis across individual findings, certainty levels, and report lengths reveals that CheX-LLM particularly excels at handling complex descriptions, negation, and uncertainty, and exhibits greater robustness. Our work demonstrates the potential of training LLMs directly for structured medical text extraction tasks, offering a promising avenue for more accurate and reliable automated report labeling.

Version published to 10.20944/preprints202504.1668.v1
Apr 21, 2025

From Images to Reports: The Future of Deep Learning in Radiology Report Generation

This article has 1 author:
1. Gyeong Jung
This article has no evaluationsLatest version Apr 2, 2025
Bridging Medical Imaging and Reports:Learning Radiologist's Nuances via Fine-Grained Multi-Modal Alignment

This article has 13 authors:
1. Xiang Li
2. Wenting Chen
3. Hui Ren
4. Yujin Oh
5. Yihan Cao
6. Elshaimaa Sharaf
7. Jiebo Luo
8. Hong-Yu Zhou
9. Lichao Sun
10. Tianming Liu
11. Linlin Shen
12. Quanzheng Li
13. Yixuan Yuan
This article has no evaluationsLatest version Mar 25, 2025
Evaluating Large Reasoning Model Performance on Complex Medical Scenarios In The MMLU-Pro Benchmark

This article has 4 authors:
1. R. E. Hoyt
2. D. Knight
3. M. Haider
4. M. Bajwa
This article has no evaluationsLatest version Apr 19, 2025

Listed in

Abstract

Article activity feed

Related articles

From Images to Reports: The Future of Deep Learning in Radiology Report Generation

Bridging Medical Imaging and Reports:Learning Radiologist's Nuances via Fine-Grained Multi-Modal Alignment

Evaluating Large Reasoning Model Performance on Complex Medical Scenarios In The MMLU-Pro Benchmark