Representation changes across varying clinical input conditions: A dual-metric validation study of eight transformer architectures with length controls

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background: Large language models are increasingly deployed in clinical decision support, yet the stability of their internal representations across diverse clinical input conditions remains poorly characterised. It is unclear whether changes in representation reflect geometric reorganisation (magnitude and directional shifts) or simple scaling artefacts. Methods: We used a four-group validation design across eight transformer models (2018–2023), including two modern architectures (Llama-2-7B, Mistral-7B). Group 1 comprised 450 MT samples of clinical notes, stratified into simple/moderate/complex length-proxied strata (n = 150 each). Group 2 comprised 450 matched synthetic texts. Groups 3–4 comprised 600 length-controlled texts isolating pure length effects. From the final hidden layer (excluding special tokens), we computed (i) per-token embedding magnitude (mean L2 norm) and (ii) mean pairwise cosine similarity between token embeddings. Analyses used one-way ANOVA with Bonferroni correction across eight models (α = 0.00625), Welch t-tests, and bootstrap 95% confidence intervals (10,000 iterations). Results: In Group 1, seven of eight models showed significant differences in magnitude across strata (p < 0.00625). Six of these seven also showed significant directional changes (cosine similarity changes of 3–26%), indicating geometric changes rather than scaling alone. BioBERT and ClinicalBERT showed the largest dual-metric effects (magnitude +8.5% and +7.3%; cosine −25.9% and −25.0%). Llama-2-7B showed no significant magnitude change (−0.6%, p = 0.062) and a non-significant simple-to-complex cosine change (+3.6%, p = 0.126). Mistral-7B showed a small but significant magnitude increase (+1.9%, p < 0.001) and significant directional convergence (cosine +14.4%, p < 0.001). Length-controlled analyses confirmed substantial length effects on both metrics. Conclusions: In older models, representation changes across length-proxied strata of clinical complexity are predominantly geometric. Modern architectures exhibit smaller magnitude shifts and a convergence trend in cosine similarity, in contrast to directional divergence in older models. Whether these representation-level changes translate into differences in downstream clinical task performance remains to be established.

Article activity feed