Multimodal Transformer for Heart Disease Classification Using Multiple Heart Sound Spectral Analyses and Clinical Metadata
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
This study proposes a multimodal fusion framework for heart disease classification that combines multiple spectral analyses of heart sounds with clinical metadata. Our framework integrates four complementary spectral analysis methods (Stockwell transform, bispectrum, mel-spectrogram, and power spectrum) using Vision Transformers. This spectral fusion is further enhanced by incorporating clinical metadata processed through Bio\_ClinicalBERT, enabling the capture of diagnostic insights from specialist physicians. Our framework achieved superior performance, with the model that fused both spectral features and clinical data reaching an accuracy of 0.881. This fusion model outperformed the individual spectral models, which had an accuracy of 0.831, by 6\%. Additionally, incorporating clinical metadata resulted in a 2.6\% improvement in accuracy compared to the model that fused only the four spectral features, which had an accuracy of 0.855. Through SHAP analysis, we discovered that our model excels at detecting right-heart abnormalities, which are often difficult to identify through traditional auscultation. Furthermore, we identified the Stockwell transform and mel-spectrogram as particularly influential features. The Stockwell transform's ability to localize both time and frequency information allowed our model to capture transient patterns crucial for detecting subtle heart sound abnormalities. Similarly, the mel-spectrogram, designed to mimic human auditory perception, excelled in highlighting frequency-related features commonly recognized by clinicians.