Improved Hallo: Identity-aware and High-fidelity Audio-Driven Portrait Image Animation

Xiangwen Ma
Jiaxin Zhao
Xiaoyu Huang
San Li
Yang Li
Junping Yin

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Speech-driven portrait animation generation models have made significant progress in generating realistic and dynamic portrait animations. The class of end-to-end latent diffusion paradigms represented by Hallo17 achieves impressive results in terms of alignment accuracy between audio inputs and visual outputs, encompassing lip movements, expressions and head poses. However, constrained by the suboptimal interaction design between reference portrait information and the denoising U-Net in such architectures, certain frames in the output video sequences suffer from inconsistencies in identity and background preservation. Moreover, the temporal attention within the temporal module operates by incorporating information across frames within each generation unit to capture overall motion trends, but ignoring shorter frame subsequences within the generation unit, consequently losing fine-grained details between adjacent frames. In order to solve the above problems, we take the Hallo pre-trained model17 as the backbone network, and construct a Multi-Source Self Attention (MSSA) to optimize the interaction between reference portrait identity information and denoising U-Net. In addition, we also propose a plug-and-play, training-free method known as Unit-wise Spectral-Blend Temporal Attention (U-SBTA), which enables simultaneously capture local high-frequency facial details from shorter frame subsequences within each generation unit, thereby improving facial fidelity in synthesized portrait videos. Our method is comprehensively evaluated on public dataset and our collected datasets from qualitative and quantitative analysis. The results demonstrate that the portrait animation videos generated by our method are better able to preserve identity and background consistency with the reference portrait, as well as exhibiting superior facial detail fidelity.

Version published to 10.21203/rs.3.rs-6850056/v1 on Research Square
Jun 23, 2025

TransMODAL: A Dual-Stream Transformer with Adaptive Co-Attention for Efficient Human Action Recognition

This article has 3 authors:
1. Majid Joudaki
2. Mehdi Imani
3. Hamid R. Arabnia
This article has no evaluationsLatest version Jul 29, 2025
Deep Learning-Based Predictive Analysis of Daylight Transitions in Photographic Images

This article has 1 author:
1. Rishabh Jaiswal
This article has no evaluationsLatest version Jun 17, 2025
Weakly Supervised Temporal Action Localization Based on Feature Enhancement

This article has 2 authors:
1. Hongying Zhang
2. Yi Yao
This article has no evaluationsLatest version Jun 9, 2025

Listed in

Abstract

Article activity feed

Related articles

TransMODAL: A Dual-Stream Transformer with Adaptive Co-Attention for Efficient Human Action Recognition

Deep Learning-Based Predictive Analysis of Daylight Transitions in Photographic Images

Weakly Supervised Temporal Action Localization Based on Feature Enhancement