Cadence: A Benchmark Evaluation of the Narrative Velocity Framework for Next Clinical Event Prediction in MIMIC-IV
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Objective
How structured clinical features and cluster-semantic embeddings interact under self-distillation in EHR prediction models is unknown. Existing approaches treat these sources separately (gradient-boosted trees exploit tabular features while sequence models process text), and their interaction under self-distillation regularisation remains uncharacterised. We introduce the Narrative Velocity (NV) framework and evaluate this interaction in a 7-model benchmark.
Materials and Methods
Cadence is a ∼5.86M-parameter residual multilayer perceptron (MLP) combining structured EHR features with frozen PubMedBERT embeddings of cluster-label strings under born-again self-distillation from a prior Cadence checkpoint (seed-42 teacher; [1]). Cadence is benchmarked against six comparators on MIMIC-IV v3.1 with dual-sex TRIPOD+AI reporting (5 student seeds for Cadence; 2–3 seeds for baselines).
Results
At full-cohort scale, Cadence achieves 38.04 ± 0.04% male and 35.66 ± 0.04% female top-1 accuracy, exceeding the strongest non-neural baseline (XGBoost-2420, trained on the identical 2,420-dimensional input) by +1.35 pp male and +0.82 pp female (paired t -test on shared seeds 42–44: t (2) = 69.06, p = 2.10 × 10 −4 male; t (2) = 25.32, p = 1.56 × 10 −3 female). On time-to-next-event regression Cadence lowers MAE by 7.68 d male and 7.30 d female versus XGBoost-2420; FT-Transformer attains the lowest absolute MAE at full scale (27.58 d male, 36.63 d female), revealing a classification-regression trade-off across model families. A controlled 2 × 2 random-vector ablation isolates the self-distillation–embedding interaction at +0.49 pp top-1 (95% CI [0.35, 0.64] pp; bootstrap, n = 10,000 resamples; 3-teacher-seed mean +0.513 ± 0.010 pp) under a matched-dimensionality null. A 3-teacher-seed validation ( multi_teacher_02 ) confirms the interaction is robust to teacher-seed identity (per-seed values +0.525, +0.509, +0.507 pp; mean +0.513 ± 0.010 pp). Cadence achieves the best Brier score among evaluated models (0.774 male / 0.798 female) but its raw probabilities are systematically miscalibrated (ECE 0.077 vs. XGBoost-884’s 0.010); after a single scalar temperature scaling step ( T ∗ ≈ 0.81), ECE drops to ≈0.028 while Brier remains best. On a small ( n = 1,120 patients, 39,120 events) external OCR-extracted BWH cohort, Cadence ranked 3rd of 7 models with three confounded sources of error (institutional shift, OCR noise, centroid mapping); we therefore report this as a generalisation probe rather than a definitive external validation. At the longer h30 evaluation horizon Cadence’s MAE advantage reverses (47.35 d versus XGBoost 45.06 d), reflecting the absence of a matched-horizon self-distillation teacher.
Discussion
The 2 × 2 random-vector ablation confirms that the self-distillation gain on PubMedBERT embeddings (+0.78 pp) exceeds that on matched-dimensionality random vectors (+0.29 pp) by +0.49 pp, isolating the interaction to semantic content rather than feature dimensionality. The factorial decomposition (+0.49–0.51 pp interaction) and the sequential pipeline-level decomposition (Supplementary Table S3) are complementary triangulations under different reference frames and are not directly additive.
Conclusion
This 7-model benchmark establishes a dual-sex, dual-metric, cross-institutional reference for next clinical event prediction under the TRIPOD+AI reporting framework. These results characterise discrimination and calibration on a single retrospective cohort; prospective evaluation, decision-curve analysis, and harm-benefit assessment are required before clinical deployment.