Echo-Vision-FM: A Pre-training and Fine-tuning Framework for Echocardiogram Videos Vision Foundation Model
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background
Echocardiograms provide vital insights into cardiac health, but their complex, multi-dimensional data presents challenges for analysis and interpretation. Current deep learning models for echocardiogram analysis often rely on supervised training, limiting their generalizability and robustness across datasets and clinical environments.
Objective
To develop and evaluate EchoVisionFM ( E chocardiogram video Vision F oundation M odel), a self-supervised video learning framework designed to pre-train a video encoder on large-scale, unlabeled echocardiogram data. EchoVisionFM aims to produce robust and transferrable spatiotemporal representations, improving downstream performance across diverse echocardiogram datasets and clinical conditions.
Methods
Our framework employs Echo-VideoMAE, an autoencoder-based video transformer that compresses and reconstructs echocardiogram video data by masking non-overlapping video patches and leveraging a ViT encoder-decoder structure. For enhanced representation, we introduce STFF-Net , a S patio T emporal F eature F usion Net work, to integrate spatial and temporal features from the manifold representations. We pre-trained EchoVisionFM using the MIMIC-IV-ECHO dataset and fine-tuned it on the EchoNet-Dynamic dataset for downstream tasks, including classification and regression of key cardiac parameters.
Results
EchoVisionFM demonstrated superior performance in classifying left ventricular ejection fraction (LVEF), achieving an accuracy of 89.12%, an F1 score of 0.9323, and an AUC of 0.9364. In regression tasks, EchoVisionFM outperformed state-of-the-art models, with LVEF prediction reaching a mean absolute error (MAE) of 4.18% and an R 2 of 0.8022. The model also showed significant improvements in estimating end-systolic and end-diastolic volumes, with R 2 values of 0.8006 and 0.7296, respectively. Incorporating STFF-Net led to further performance gains across tasks.
Conclusion
Our results indicate that large-scale self-supervised pre-training on echocardiogram videos enables the extraction of transferable and clinically relevant features, outperforming traditional CNN-based methods. The EchoVisionFM framework, particularly with STFF-Net, enhances the extraction of spatiotemporal features, improving the predictive accuracy for various cardiac parameters. EchoVisionFM offers a powerful, scalable approach for echocardiogram analysis, with potential applications in clinical diagnostics and research.