Systematic Evaluation of Multilingual Retrieval-Augmented Generation for Gastrointestinal Tumor Board Decision Support

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Large language models (LLMs) have been proposed as decision support tools for multidisciplinary tumor boards, yet systematic preclinical validation of retrieval-augmented generation (RAG) pipelines remains lacking. In this retrospective framework validation study using real-world clinical data, we applied a modular evaluation framework to 100 gastrointestinal tumor board cases spanning five cancer types, systematically testing 16 configurations varying model variant, multilingual retrieval strategy, query formulation, and corpus scope. Baseline concordance with multidisciplinary team recommendations ranged from 79–85%. Combining query rewriting with curated guideline retrieval improved concordance to 93–95% (p < 0.01), with prompt design and corpus curation exerting greater influence than model selection. Among residual discordant cases in optimal configurations, approximately 60% represented clinically inappropriate recommendations rather than acceptable therapeutic alternatives. These findings demonstrate that systematic RAG optimization substantially improves clinical decision support concordance, while the high rate of inappropriate residual errors underscores the necessity of mandatory expert oversight before any clinical deployment.

Article activity feed