CardioFM: A Multimodal Foundation Model for Joint ECG and PPG Representation Learning
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Electrocardiography (ECG) and photoplethysmography (PPG) arise from the same heartbeat and are routinely co-acquired at every monitored bedside, yet no foundation model jointly encodes both modalities. Existing approaches are either ECG-specific, PPG-specific, or domain-agnostic, and none captures the cross-modal physiological coupling between cardiac electrical activity and peripheral hemodynamics. We present CardioFM, a self-supervised multimodal foundation model that integrates ECG Lead-II and PPG through bidirectional cross-modal attention and adaptive residual vector quantization. CardioFM is pretrained on over 500,000 hours from approximately 63,000 patients across intensive care, surgical, ambulatory, and consumer-wearable settings, learning unified representations that transfer across contexts without retraining. CardioFM achieves an F1-score of 0.86 for cardiovascular disease classification on PTB-XL, estimates the QT interval with a mean error of 20.2 ms approaching expert inter-observer variability, and measures pulse arrival time with a mean error of 22.7 ms sufficient to support non-invasive hemodynamic trending. When used as a feature extractor, CardioFM embeddings provide superior discrimination for intensive care false alarm reduction compared with ECG-FM, PaPaGei, and TimesFM, despite requiring substantially smaller representations. In contrast, generic temporal pretraining fails to encode clinically relevant waveform morphology. Demographic inference from waveform embeddings (age MAE: 10.4 years; gender AUC: 0.97; BMI MAE: 0.66 kg/m 2 ) confirms that the learned representations encode fundamental biological characteristics without requiring diagnostic labels. The model maintains zero-shot reconstruction fidelity across five independent datasets spanning heterogeneous sensor hardware, sampling rates, and patient populations, with the cross-modal attention mechanism providing robustness to single-modality signal degradation. The 17.11-million-parameter encoder is compatible with edge-deployment constraints, and the model uses only signal modalities already acquired by standard bedside monitors and consumer wearables, requiring no additional sensing hardware. These findings demonstrate that a single multimodal foundation model can consolidate the fragmented landscape of cardiac biosignal analysis, providing a unified representational framework across clinical monitoring systems and wearable health technologies that may extend to broader critical illness surveillance.