Clinical Note Comparison and Data Retrieval Via Embedding Vectors: Model Selection, Metrics, and Convergence

Alexandra Dahlberg
Olli Tapiola
Rami Luisto
Tuukka Puranen
Enni Sanmark
Ville Vartiainen

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

Embedding models are an integral part of generative AI architectures, transforming text into embedding vectors that represent semantic content in numerical form. Despite their central role, their performance in clinical settings remains underexplored. We evaluate embedding models across two tasks: semantic difference detection in clinical texts, and data retrieval from patient records.

Methods

Eight models were applied to synthetic discharge summaries in English, Finnish, and Swedish. Semantic sensitivity was assessed by introducing controlled perturbations (deletion, modification, and paraphrasing) at three levels of severity; cosine similarity, and L ¹ and Euclidean distances were computed between the vectors of the original and perturbed texts. Partial vectors were compared to explore dimensionality reduction. Two models with the biggest contrast in semantic difference detection were evaluated on retrieval of relevant information from real Finnish vascular surgery records.

Results

Embedding vectors captured semantic differences in clinical text: content deletion and modification produced larger increases in vector distance than paraphrasing. On average, models detected the direction of semantic change correctly, but case-level performance varied considerably. Qwen3-Embedding-8B was the only model with zero directional errors, while multilingual-E5-large erred in 13.8% of cases. In data retrieval, Qwen3-Embedding-8B again outperformed multilingual-E5-large, though the margin was narrower: sufficiency scores were 3.25 vs. 3.17 out of 5 for the first query and 2.25 vs. 1.15 out of 5 for the second query. For some models, as few as 0.6-1.2% of dimensions sufficed to replicate full-vector accuracy; principal component analysis and coordinate-level analysis did not account for this finding.

Conclusions

Our results show that the choice of embedding model is important: performance differences between models can be large enough to determine whether clinically relevant information reaches the end user, and model weaknesses can be both task-specific and context-dependent.

Version published to 10.64898/2026.05.12.26352832 on medRxiv
May 18, 2026

Early economic evaluation of retrieval-layer correction in clinical RAG: a decision-uncertainty framework

This article has 1 author:
1. Yngve Mikkelsen
This article has no evaluationsLatest version Mar 30, 2026
Representation changes across varying clinical input conditions: A dual-metric validation study of eight transformer architectures with length controls

This article has 1 author:
1. Yngve Mikkelsen
This article has no evaluationsLatest version Mar 30, 2026
Cadence: A Benchmark Evaluation of the Narrative Velocity Framework for Next Clinical Event Prediction in MIMIC-IV

This article has 2 authors:
1. Amir Rouhollahi
2. Farhad R. Nezami
This article has no evaluationsLatest version May 11, 2026

Discuss this preprint

Listed in

Abstract

Background

Methods

Results

Conclusions

Article activity feed

Related articles

Early economic evaluation of retrieval-layer correction in clinical RAG: a decision-uncertainty framework

Representation changes across varying clinical input conditions: A dual-metric validation study of eight transformer architectures with length controls

Cadence: A Benchmark Evaluation of the Narrative Velocity Framework for Next Clinical Event Prediction in MIMIC-IV