Improving CXR Report Labeling Through LLM Fine-Tuning and Human Feedback
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Automated labeling of radiological findings from free-text chest X-ray reports is a critical task for enabling large-scale clinical research and developing artificial intelligence applications in medical imaging. However, manual annotation is prohibitively expensive and time-consuming. Existing automated methods, including rule-based systems, traditional machine learning, fine-tuned smaller language models like BERT, and approaches leveraging large language models (LLMs) primarily for pseudo-label generation, face limitations in accurately capturing the complex nuances, negations, and uncertainties present in clinical narratives. Direct inference using powerful proprietary LLMs via API is often computationally expensive for large datasets. To address these challenges, we propose CheX-LLM, a novel approach that directly fine-tunes an open-source large language model as an end-to-end inference model for chest X-ray report labeling. Our method employs a two-stage training strategy: Supervised Fine-Tuning to adapt the base LLM to the task and output format, followed by Reinforcement Learning from Human Feedback (RLHF) to align the model's generated structured labels with expert radiological judgments. We evaluate CheX-LLM on the benchmark MIMIC-500 dataset and compare its performance against state-of-the-art methods, including CheXpert Labeler, CheXbert, CheX-GPT, and GPT-4. Quantitative results demonstrate that CheX-LLM achieves a state-of-the-art Macro F1 score of 0.9115, surpassing all baselines. Furthermore, a blinded human evaluation by board-certified radiologists confirms that CheX-LLM produces outputs with significantly fewer errors and a higher percentage of perfect reports. Analysis across individual findings, certainty levels, and report lengths reveals that CheX-LLM particularly excels at handling complex descriptions, negation, and uncertainty, and exhibits greater robustness. Our work demonstrates the potential of training LLMs directly for structured medical text extraction tasks, offering a promising avenue for more accurate and reliable automated report labeling.