End-to-End Multi-Modal Speaker Change Detection with Pre-trained Models
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
In this work, we propose a multi-modal Speaker Change Detection (SCD) approach with focal loss that integrates both audio and text features to enhance detection performance. % The proposed approach utilizes pre-trained large-scale models for feature extraction and incorporates a self-attention mechanism to optimize the useful features related to speaker change. % The extracted features are fused and processed through a fully connected classification network, with layer normalization and dropout for stability and generalization. % To address class imbalance, we apply focal loss, which reduces errors for the difficult samples, leading to better balanced performance. Extensive experiments on a multi-talker meeting dataset demonstrate that the proposed multi-modal approach consistently outperforms single-modal models, proving the complementary nature of audio and text for SCD. % Fine-tuning pre-trained models (Wav2Vec2 and Bert) for audio and text significantly boosts accuracy, achieving a 21\% improvement over frozen models. % The self-attention mechanism further improves performance by 2\%, highlighting its ability to capture speaker transition cues effectively. % Additionally, focal loss enhances model's performance, making it more robust to imbalanced data.