Echo-Vision-FM: A Pre-training and Fine-tuning Framework for Echocardiogram Video Vision Foundation Model
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background
Echocardiograms provide essential insights into cardiac health, yet their complex, multidimensional data poses significant challenges for analysis and interpretation. Existing deep learning models for echocardiogram analysis often rely heavily on supervised training, which limits their generalizability and robustness across different datasets and clinical environments.
Objective
To develop and evaluate Echo-Vision-FM ( Echo cardiogram video Vision F oundation M odel), a self-supervised video learning framework designed to pre-train a video encoder on large-scale, unlabeled echocardiogram data. Echo-Vision-FM aims to produce robust and transferable video representations, improving downstream performance across diverse echocardiogram datasets and clinical conditions.
Methods
The proposed framework employs advanced self-supervised video learning through a masked auto-encoding technique, which compresses segments of video data and reconstructs the full video by masking non-overlapping video patches. An asymmetric encoder-decoder architecture underpins this approach. To further enhance the learned representations, we introduce STF-Net , a S patial- T emporal F usion Net , which integrates spatial and temporal correlations from the video representations. We pre-trained Echo-Vision-FM using the MIMIC-IV-ECHO dataset and fine-tuned it across multiple downstream datasets for specific clinical tasks, including morphological value estimation and the diagnosis of heart function and diseases.
Results
Echo-Vision-FM achieved superior performance in classifying left ventricular ejection fraction ( LV EF ), with an accuracy of 0.905, an F1 score of 0.941, and an AUC of 0.931. In regression tasks, Echo-Vision-FM outperformed state-of-the-art models, achieving a mean absolute error (MAE) of 3.87% and an r 2 of 0.825 for LV EF prediction. The model also demonstrated significant improvements in estimating end-systolic and end-diastolic volumes, with r 2 values of 0.782 and 0.742, respectively. Incorporating STF-Net further enhanced performance across all tasks.
Conclusion
Our results demonstrate that large-scale self-supervised video learning on echocardiogram data enables the extraction of transferable and clinically relevant features, surpassing existing methods. The Echo-Vision-FM framework, particularly with the inclusion of STF-Net, significantly improves the extraction of spatiotemporal features, resulting in enhanced predictive accuracy for a range of cardiac parameters. Echo-Vision-FM offers a scalable and effective solution for echocardiogram analysis, with promising applications in clinical diagnostics and research.