RAE-NeRF: Residual-Based Audio-Video Encoder with Denoising in Talking Head Synchronization

Wengang Pang
Xiang Li
Taotao Tang
Weihua Wu
Xinyu Chang
Lin Zhang

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

In recent years, speech-driven facial synthesis has attracted significant attention due to its wide applications in virtual humans, remote conferencing, and digital human generation. However, existing methods still face limitations in terms of realism, synchronization, and robustness, primarily due to noise interference in speech signals and insufficient precision in audio-visual feature fusion. To address these challenges, this paper proposes an enhanced speech-driven facial synthesis framework: RAE-NeRF (Residual-based Audio-video Encoder with Neural Radiance Fields). The framework integrates three core modules: (1) the ZipEnhancer speech enhancement module, which extracts high-quality features from noisy speech; (2) a residual-based audio-visual encoder that effectively fuses audio and visual features to drive facial expressions accurately; and (3) a tri-plane hash encoder that achieves high-quality 3D facial modeling and rendering while maintaining efficiency. Extensive experiments conducted on multiple datasets demonstrate that RAE-NeRF significantly outperforms existing mainstream approaches in terms of realism, lip-sync accuracy, and noise robustness, validating the proposed framework’s effectiveness and superiority in complex environments for speech-driven facial synthesis.

Version published to 10.20944/preprints202509.2231.v1
Sep 26, 2025

Advancements in Talking Head Generation: A Comprehensive Review of Techniques, Metrics, and Challenges

This article has 6 authors:
1. Vineet Kumar Rakesh
2. Soumya Mazumdar
3. Research Pratim Maity
4. Sarbajit Pal
5. Amitabha Das
6. Tapas Samanta
This article has no evaluationsLatest version Sep 4, 2025
Toward Non-Invasive Voice Restoration: A Deep Learning Approach Using Real-Time MRI

This article has 1 author:
1. Mohamad Saleh
This article has no evaluationsLatest version Aug 26, 2025
High-Fidelity Neural Speech Reconstruction through an Efficient Acoustic-Linguistic Dual-Pathway Framework

This article has 5 authors:
1. Jiawei Li
2. Chunxu Guo
3. Chao Zhang
4. Edward F. Chang
5. Yuanning Li
This article has no evaluationsLatest version Sep 25, 2025

Listed in

Abstract

Article activity feed

Related articles

Advancements in Talking Head Generation: A Comprehensive Review of Techniques, Metrics, and Challenges

Toward Non-Invasive Voice Restoration: A Deep Learning Approach Using Real-Time MRI

High-Fidelity Neural Speech Reconstruction through an Efficient Acoustic-Linguistic Dual-Pathway Framework