Development of a RAG-based Expert LLM for Clinical Support in Radiation Oncology
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The ability of pre-trained large language models (LLMs) to rapidly master novel natural language processing tasks holds transformative potential. However, pre-trained LLMs often struggle to achieve high performance in specialized domains such as oncology and have the tendency to deliver incorrect information confidently (“hallucinate”), limiting their utility in such contexts. Retrieval-augmented generation (RAG) addresses this limitation by dynamically incorporating authoritative, domain-specific knowledge directly into the LLM’s inference process. This approach significantly enhances LLM performance without the typical requirement for extensive fine-tuning or retraining.
In this study, we demonstrate the exceptional performance of a minimalist RAG pipeline (without additional model fine-tuning) on radiation oncology board-style examinations. Leveraging a meticulously curated knowledge base sourced from Gunderson & Tepper’s Clinical Radiation Oncology, Fifth Edition and NCCN guidelines, our model substantially surpassed the performance of contemporary OpenAI models, achieving an outstanding accuracy of 91.5% on the 2021 American College of Radiology (ACR) TXIT examination. This result markedly exceeds the performance benchmarks set by previous LLM-based approaches in this field, which attained a maximum accuracy of 74%.
Crucially, our model exhibited robust self-awareness regarding its knowledge boundaries, overcoming a glaring weakness of pre-trained LLMs; questions answered incorrectly were reliably flagged with low confidence scores (mean 4.12/10 vs. 7.36/10 for correct answers), highlighting areas inadequately represented within the RAG knowledge base. This precise uncertainty estimation underscores RAG’s unique strength in enhancing not just accuracy, but also the reliability and interpretability of model outputs.
We demonstrate that integrating domain-specific knowledge via RAG significantly enhances large language model performance in radiation oncology, enabling reliable confidence scoring previously unattainable with pretrained LLMs. This scalable approach may be well-suited for clinical decision support and medical education. Future efforts will incorporate clinical guidelines and select primary literature to broaden applicability.