A Comparative Performance Study of Retrieval-Augmented Generation Systems in Gynecologic Oncology

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Large language models (LLMs) show great potential in oncology, but their utility is limited by hallucinations and static knowledge. Retrieval-augmented generation (RAG), which grounds model outputs in curated clinical sources, can mitigate these issues. However, systematic, head-to-head evaluations of different RAG variants in oncology tasks are lacking. We compared thirteen RAG architectures with a non-RAG baseline using a single LLM (DeepSeek-R1-0528), one embedding model (BAAI/bge-m3), and a knowledge base derived from gynecologic oncology guidelines and textbooks. Two question sets were used to evaluate system performance: one focused on cervical cancer management, and the other on gynecologic oncology surgery. Each response was independently graded by gynecologic experts. Across domains, RAG systems generally outperformed the non-RAG baseline. The largest gains were observed for complex surgical questions, while improvements for guideline-based cervical cancer management questions were modest and heterogeneous. These findings demonstrate the domain-dependent value of RAG and the need for rigorous benchmarking before clinical deployment.

Article activity feed