Improved Hallo: Identity-aware and High-fidelity Audio-Driven Portrait Image Animation
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Speech-driven portrait animation generation models have made significant progress in generating realistic and dynamic portrait animations. The class of end-to-end latent diffusion paradigms represented by Hallo17 achieves impressive results in terms of alignment accuracy between audio inputs and visual outputs, encompassing lip movements, expressions and head poses. However, constrained by the suboptimal interaction design between reference portrait information and the denoising U-Net in such architectures, certain frames in the output video sequences suffer from inconsistencies in identity and background preservation. Moreover, the temporal attention within the temporal module operates by incorporating information across frames within each generation unit to capture overall motion trends, but ignoring shorter frame subsequences within the generation unit, consequently losing fine-grained details between adjacent frames. In order to solve the above problems, we take the Hallo pre-trained model17 as the backbone network, and construct a Multi-Source Self Attention (MSSA) to optimize the interaction between reference portrait identity information and denoising U-Net. In addition, we also propose a plug-and-play, training-free method known as Unit-wise Spectral-Blend Temporal Attention (U-SBTA), which enables simultaneously capture local high-frequency facial details from shorter frame subsequences within each generation unit, thereby improving facial fidelity in synthesized portrait videos. Our method is comprehensively evaluated on public dataset and our collected datasets from qualitative and quantitative analysis. The results demonstrate that the portrait animation videos generated by our method are better able to preserve identity and background consistency with the reference portrait, as well as exhibiting superior facial detail fidelity.