Early economic evaluation of retrieval-layer correction in clinical RAG: a decision-uncertainty framework
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background Embedding geometry degradation is common in clinical retrieval-augmented generation (RAG) systems and clearly reduces retrieval accuracy. Corpus-only ZCA whitening is a no-retraining correction that boosts retrieval accuracy on diverse clinical text, but its cost-effectiveness depends on whether these improvements lead to better clinical outcomes, a connection that has not yet been empirically confirmed in RAG settings. Objective To quantify the conditions where a low-cost retrieval-layer intervention could be economically viable, and to identify the empirical parameters whose measurement would most decrease decision uncertainty. Methods An exploratory decision model with explicit structural gating was developed from a healthcare system perspective (Norwegian reference case, 4% discount rate, 5-year horizon). Whitening effectiveness was modeled across two corpus branches: beneficial on heterogeneous corpora (base-case ΔMRR = + 0.221); harmful on homogeneous corpora (ΔMRR = − 0.05). The surrogate link from retrieval improvement to diagnostic accuracy (α) was empirically estimated from the DiReCT dataset (MIMIC-IV-Ext-DiReCT, NeurIPS 2024): 511 physician-annotated clinical notes from MIMIC-IV, with ZCA whitening applied to ClinicalBERT embeddings and measuring change in primary discharge diagnosis retrieval accuracy. The primary outputs are scale-independent: minimum annual query volume (N*) for cost-effectiveness, and outcomes per 1,000 queries. Results A DiReCT-based retrieval experiment estimated an empirical α = 1.111 (95% CI [1.014, 2.541]; ClinicalBERT, PDD-level) in a diagnosis-label retrieval setting, replacing the transported Tao et al. CDSS estimates (0.36) as the base case; the Tao et al. estimate is maintained as the conservative scenario. The experiment used 343 MIMIC-IV clinical notes with sufficient text content (from the full DiReCT dataset of 511 annotated notes). The minimum N* for whitening to cover its €800 implementation cost is 6 annual queries at the base-case parameters and 18 at the conservative α = 0.36, thresholds that are low compared to typical institutional deployment scales. Per 1,000 annual queries, whitening prevents 4.74 adverse diagnostic events (base case) or 1.53 (conservative), resulting in €253,008 or €81,983 in healthcare savings over 5 years, respectively. These estimates depend on whether improvements in diagnosis-label retrieval accuracy translate into actual clinician diagnostic performance, a structural assumption the DiReCT experiment does not itself address. Conclusions This framework shows that whitening appears economically plausible across the modelled cost structure. The DiReCT experiment provides an empirical α estimate in a clinical-note retrieval task with diagnosis-label relevance, substantially above the previously transported CDSS estimate (α = 0.36), which is retained as the conservative scenario. The remaining structural uncertainty, whether diagnosis-label retrieval translates to clinician diagnostic performance, would require a case-level linkage study with adequate causal identification to resolve.