Towards Human-Centered and Efficient Video Synthesis: A Survey of Multimodal Diffusion Models
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Multimodal video diffusion models have emerged as transformative tools for controlled video synthesis, integrating text, images, audio, and pose sequences to generate semantically meaningful content. Despite significant advances, critical gaps persist in temporal consistency, multimodal alignment, and human-centric motion generation. Existing surveys have not addressed clearly the complex interplay between these components, particularly physiological constraints and identity preservation in human motion synthesis. This survey provides a comprehensive analysis through a unified architectural framework, examining spatial-temporal representations and multimodal conditioning mechanisms. We present the first systematic evaluation of human-centric motion modeling, addressing physiological plausibility and identity consistency challenges. Our analysis reveals fundamental trade-offs between computational efficiency and generation quality, demonstrating that specialized techniques like temporal block pruning achieve 523× computational savings with minimal quality degradation. Key findings indicate that current approaches struggle with seamless multimodal integration, human-centric applications face "uncanny valley" effects when physics constraints are too rigid, and identity preservation conflicts with motion dynamics. We introduce MIME-Vid (Multi-modal Integration with Motion Enhancement for Video Generation), a novel framework that integrates advanced Kalman filtering techniques with multi-modal architecture for enhanced temporal consistency and motion realism. Furthermore, we propose novel evaluation paradigms and identify future research directions for advancing multimodal video generation