Representation changes across varying clinical input conditions: A dual-metric validation study of eight transformer architectures with length controls

Yngve Mikkelsen

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background: Large language models are increasingly deployed in clinical decision support, yet the stability of their internal representations across diverse clinical input conditions remains poorly characterised. It is unclear whether changes in representation reflect geometric reorganisation (magnitude and directional shifts) or simple scaling artefacts. Methods: We used a four-group validation design across eight transformer models (2018–2023), including two modern architectures (Llama-2-7B, Mistral-7B). Group 1 comprised 450 MT samples of clinical notes, stratified into simple/moderate/complex length-proxied strata (n = 150 each). Group 2 comprised 450 matched synthetic texts. Groups 3–4 comprised 600 length-controlled texts isolating pure length effects. From the final hidden layer (excluding special tokens), we computed (i) per-token embedding magnitude (mean L2 norm) and (ii) mean pairwise cosine similarity between token embeddings. Analyses used one-way ANOVA with Bonferroni correction across eight models (α = 0.00625), Welch t-tests, and bootstrap 95% confidence intervals (10,000 iterations). Results: In Group 1, seven of eight models showed significant differences in magnitude across strata (p < 0.00625). Six of these seven also showed significant directional changes (cosine similarity changes of 3–26%), indicating geometric changes rather than scaling alone. BioBERT and ClinicalBERT showed the largest dual-metric effects (magnitude +8.5% and +7.3%; cosine −25.9% and −25.0%). Llama-2-7B showed no significant magnitude change (−0.6%, p = 0.062) and a non-significant simple-to-complex cosine change (+3.6%, p = 0.126). Mistral-7B showed a small but significant magnitude increase (+1.9%, p < 0.001) and significant directional convergence (cosine +14.4%, p < 0.001). Length-controlled analyses confirmed substantial length effects on both metrics. Conclusions: In older models, representation changes across length-proxied strata of clinical complexity are predominantly geometric. Modern architectures exhibit smaller magnitude shifts and a convergence trend in cosine similarity, in contrast to directional divergence in older models. Whether these representation-level changes translate into differences in downstream clinical task performance remains to be established.

Version published to 10.21203/rs.3.rs-9237602/v1 on Research Square
Mar 30, 2026

Early economic evaluation of retrieval-layer correction in clinical RAG: a decision-uncertainty framework

This article has 1 author:
1. Yngve Mikkelsen
This article has no evaluationsLatest version Mar 30, 2026
Benchmarking General-Purpose and Medical AI Large Language Models for Clinical Assessment and Management in Parkinson’s Disease

This article has 5 authors:
1. Shechter Yosef
2. Klevor Raymond
3. Kouchache Trycia
4. Bouhadoun Sarah
5. Ronald B Postuma
This article has no evaluationsLatest version May 20, 2026
Cadence: A Benchmark Evaluation of the Narrative Velocity Framework for Next Clinical Event Prediction in MIMIC-IV

This article has 2 authors:
1. Amir Rouhollahi
2. Farhad R. Nezami
This article has no evaluationsLatest version May 11, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Early economic evaluation of retrieval-layer correction in clinical RAG: a decision-uncertainty framework

Benchmarking General-Purpose and Medical AI Large Language Models for Clinical Assessment and Management in Parkinson’s Disease

Cadence: A Benchmark Evaluation of the Narrative Velocity Framework for Next Clinical Event Prediction in MIMIC-IV