FedEmoNet: Privacy-Preserving FederatedLearning with TCN-Transformer Fusion forCross-Corpus Speech Emotion Recognition

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Cross-corpus speech emotion recognition faces significant challenges due to domainshifts and privacy concerns, with existing systems showing 20–40% performance degradation across datasets while requiring centralized data collection. This paper presents aprivacy-preserving federated learning framework integrating FedProx-based distributedtraining with a hybrid TCN-Transformer architecture, PSO-optimized feature selection,and formal differential privacy guarantees. The federated protocol enables collaborativemodel training across five distributed clients under non-IID data distribution (Dirichletα = 0.5) without sharing raw speech data. Within each client, the local model employsmulti-scale phase space reconstruction at micro (25ms), meso (250ms), and macro (2.5s)temporal scales, combined with spectral and handcrafted features processed through aTCN-Transformer fusion architecture. Formal (ϵ = 1.0, δ = 10−5)-differential privacyis achieved via gradient clipping and calibrated noise injection. Experiments followa consistent 80/20 train-test split with subject-independent validation. The framework achieves 99.07%±0.35% accuracy on EmoDB and 98.96%±0.42% on RAVDESS,with cross-corpus evaluation on CREMA-D achieving 68.15% ± 1.23% without finetuning. Ablation studies quantify component contributions: PSO feature selection(+2.80%), Transformer blocks (+2.10%), and FedProx protocol (+2.62%). Privacyanalysis demonstrates membership inference attack resistance with AUC reduced to0.52 while maintaining 98.5% accuracy under differential privacy constraints.

Article activity feed