Early economic evaluation of retrieval-layer correction in clinical RAG: a decision-uncertainty framework

Yngve Mikkelsen

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background Embedding geometry degradation is common in clinical retrieval-augmented generation (RAG) systems and clearly reduces retrieval accuracy. Corpus-only ZCA whitening is a no-retraining correction that boosts retrieval accuracy on diverse clinical text, but its cost-effectiveness depends on whether these improvements lead to better clinical outcomes, a connection that has not yet been empirically confirmed in RAG settings. Objective To quantify the conditions where a low-cost retrieval-layer intervention could be economically viable, and to identify the empirical parameters whose measurement would most decrease decision uncertainty. Methods An exploratory decision model with explicit structural gating was developed from a healthcare system perspective (Norwegian reference case, 4% discount rate, 5-year horizon). Whitening effectiveness was modeled across two corpus branches: beneficial on heterogeneous corpora (base-case ΔMRR = + 0.221); harmful on homogeneous corpora (ΔMRR = − 0.05). The surrogate link from retrieval improvement to diagnostic accuracy (α) was empirically estimated from the DiReCT dataset (MIMIC-IV-Ext-DiReCT, NeurIPS 2024): 511 physician-annotated clinical notes from MIMIC-IV, with ZCA whitening applied to ClinicalBERT embeddings and measuring change in primary discharge diagnosis retrieval accuracy. The primary outputs are scale-independent: minimum annual query volume (N*) for cost-effectiveness, and outcomes per 1,000 queries. Results A DiReCT-based retrieval experiment estimated an empirical α = 1.111 (95% CI [1.014, 2.541]; ClinicalBERT, PDD-level) in a diagnosis-label retrieval setting, replacing the transported Tao et al. CDSS estimates (0.36) as the base case; the Tao et al. estimate is maintained as the conservative scenario. The experiment used 343 MIMIC-IV clinical notes with sufficient text content (from the full DiReCT dataset of 511 annotated notes). The minimum N* for whitening to cover its €800 implementation cost is 6 annual queries at the base-case parameters and 18 at the conservative α = 0.36, thresholds that are low compared to typical institutional deployment scales. Per 1,000 annual queries, whitening prevents 4.74 adverse diagnostic events (base case) or 1.53 (conservative), resulting in €253,008 or €81,983 in healthcare savings over 5 years, respectively. These estimates depend on whether improvements in diagnosis-label retrieval accuracy translate into actual clinician diagnostic performance, a structural assumption the DiReCT experiment does not itself address. Conclusions This framework shows that whitening appears economically plausible across the modelled cost structure. The DiReCT experiment provides an empirical α estimate in a clinical-note retrieval task with diagnosis-label relevance, substantially above the previously transported CDSS estimate (α = 0.36), which is retained as the conservative scenario. The remaining structural uncertainty, whether diagnosis-label retrieval translates to clinician diagnostic performance, would require a case-level linkage study with adequate causal identification to resolve.

Version published to 10.21203/rs.3.rs-9237671/v1 on Research Square
Mar 30, 2026

Representation changes across varying clinical input conditions: A dual-metric validation study of eight transformer architectures with length controls

This article has 1 author:
1. Yngve Mikkelsen
This article has no evaluationsLatest version Mar 30, 2026
Cadence: A Benchmark Evaluation of the Narrative Velocity Framework for Next Clinical Event Prediction in MIMIC-IV

This article has 2 authors:
1. Amir Rouhollahi
2. Farhad R. Nezami
This article has no evaluationsLatest version May 11, 2026
Evidence-Graded Decision Authorization for Safe Clinical AI: A Constrained Reasoning Framework

This article has 3 authors:
1. Che Lin
2. Jia-Yi Lin
3. Yao-San Lin
This article has no evaluationsLatest version May 22, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Representation changes across varying clinical input conditions: A dual-metric validation study of eight transformer architectures with length controls

Cadence: A Benchmark Evaluation of the Narrative Velocity Framework for Next Clinical Event Prediction in MIMIC-IV

Evidence-Graded Decision Authorization for Safe Clinical AI: A Constrained Reasoning Framework