Towards Human-Centered and Efficient Video Synthesis: A Survey of Multimodal Diffusion Models

Alaa Abdullah Albaghdadi
Ahmad R. Naghsh-Nilchi

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Multimodal video diffusion models have emerged as transformative tools for controlled video synthesis, integrating text, images, audio, and pose sequences to generate semantically meaningful content. Despite significant advances, critical gaps persist in temporal consistency, multimodal alignment, and human-centric motion generation. Existing surveys have not addressed clearly the complex interplay between these components, particularly physiological constraints and identity preservation in human motion synthesis. This survey provides a comprehensive analysis through a unified architectural framework, examining spatial-temporal representations and multimodal conditioning mechanisms. We present the first systematic evaluation of human-centric motion modeling, addressing physiological plausibility and identity consistency challenges. Our analysis reveals fundamental trade-offs between computational efficiency and generation quality, demonstrating that specialized techniques like temporal block pruning achieve 523× computational savings with minimal quality degradation. Key findings indicate that current approaches struggle with seamless multimodal integration, human-centric applications face "uncanny valley" effects when physics constraints are too rigid, and identity preservation conflicts with motion dynamics. We introduce MIME-Vid (Multi-modal Integration with Motion Enhancement for Video Generation), a novel framework that integrates advanced Kalman filtering techniques with multi-modal architecture for enhanced temporal consistency and motion realism. Furthermore, we propose novel evaluation paradigms and identify future research directions for advancing multimodal video generation

Version published to 10.21203/rs.3.rs-7533477/v1 on Research Square
Oct 7, 2025

Advancements in Talking Head Generation: A Comprehensive Review of Techniques, Metrics, and Challenges

This article has 6 authors:
1. Vineet Kumar Rakesh
2. Soumya Mazumdar
3. Research Pratim Maity
4. Sarbajit Pal
5. Amitabha Das
6. Tapas Samanta
This article has no evaluationsLatest version Sep 4, 2025
A Unified Framework for Human Motion Generation with Multimodal Inputs

This article has 4 authors:
1. Nathan J. Blake
2. Isabella M. Cooper
3. Ryan A. Mitchell
4. Chloe S. Turner
This article has no evaluationsLatest version Aug 28, 2025
A New Paradigm for Human Motion Generation Based on Cross-Modal Nested Alignment

This article has 5 authors:
1. Ethan M. Carter
2. Sophia L. Hayes
3. Benjamin T. Walker
4. Lucas J. Reynolds
5. Emily K. Foster
This article has no evaluationsLatest version Aug 28, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Advancements in Talking Head Generation: A Comprehensive Review of Techniques, Metrics, and Challenges

A Unified Framework for Human Motion Generation with Multimodal Inputs

A New Paradigm for Human Motion Generation Based on Cross-Modal Nested Alignment