FedEmoNet: Privacy-Preserving FederatedLearning with TCN-Transformer Fusion forCross-Corpus Speech Emotion Recognition
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Cross-corpus speech emotion recognition faces significant challenges due to domainshifts and privacy concerns, with existing systems showing 20–40% performance degradation across datasets while requiring centralized data collection. This paper presents aprivacy-preserving federated learning framework integrating FedProx-based distributedtraining with a hybrid TCN-Transformer architecture, PSO-optimized feature selection,and formal differential privacy guarantees. The federated protocol enables collaborativemodel training across five distributed clients under non-IID data distribution (Dirichletα = 0.5) without sharing raw speech data. Within each client, the local model employsmulti-scale phase space reconstruction at micro (25ms), meso (250ms), and macro (2.5s)temporal scales, combined with spectral and handcrafted features processed through aTCN-Transformer fusion architecture. Formal (ϵ = 1.0, δ = 10−5)-differential privacyis achieved via gradient clipping and calibrated noise injection. Experiments followa consistent 80/20 train-test split with subject-independent validation. The framework achieves 99.07%±0.35% accuracy on EmoDB and 98.96%±0.42% on RAVDESS,with cross-corpus evaluation on CREMA-D achieving 68.15% ± 1.23% without finetuning. Ablation studies quantify component contributions: PSO feature selection(+2.80%), Transformer blocks (+2.10%), and FedProx protocol (+2.62%). Privacyanalysis demonstrates membership inference attack resistance with AUC reduced to0.52 while maintaining 98.5% accuracy under differential privacy constraints.