End-to-End Multi-Modal Speaker Change Detection with Pre-trained Models

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

In this work, we propose a multi-modal Speaker Change Detection (SCD) approach with focal loss that integrates both audio and text features to enhance detection performance. % The proposed approach utilizes pre-trained large-scale models for feature extraction and incorporates a self-attention mechanism to optimize the useful features related to speaker change. % The extracted features are fused and processed through a fully connected classification network, with layer normalization and dropout for stability and generalization. % To address class imbalance, we apply focal loss, which reduces errors for the difficult samples, leading to better balanced performance. Extensive experiments on a multi-talker meeting dataset demonstrate that the proposed multi-modal approach consistently outperforms single-modal models, proving the complementary nature of audio and text for SCD. % Fine-tuning pre-trained models (Wav2Vec2 and Bert) for audio and text significantly boosts accuracy, achieving a 21\% improvement over frozen models. % The self-attention mechanism further improves performance by 2\%, highlighting its ability to capture speaker transition cues effectively. % Additionally, focal loss enhances model's performance, making it more robust to imbalanced data.

Article activity feed