RAE-NeRF: Residual-Based Audio-Video Encoder with Denoising in Talking Head Synchronization
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
In recent years, speech-driven facial synthesis has attracted significant attention due to its wide applications in virtual humans, remote conferencing, and digital human generation. However, existing methods still face limitations in terms of realism, synchronization, and robustness, primarily due to noise interference in speech signals and insufficient precision in audio-visual feature fusion. To address these challenges, this paper proposes an enhanced speech-driven facial synthesis framework: RAE-NeRF (Residual-based Audio-video Encoder with Neural Radiance Fields). The framework integrates three core modules: (1) the ZipEnhancer speech enhancement module, which extracts high-quality features from noisy speech; (2) a residual-based audio-visual encoder that effectively fuses audio and visual features to drive facial expressions accurately; and (3) a tri-plane hash encoder that achieves high-quality 3D facial modeling and rendering while maintaining efficiency. Extensive experiments conducted on multiple datasets demonstrate that RAE-NeRF significantly outperforms existing mainstream approaches in terms of realism, lip-sync accuracy, and noise robustness, validating the proposed framework’s effectiveness and superiority in complex environments for speech-driven facial synthesis.