A Unified Framework for Human Motion Generation with Multimodal Inputs
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
To enable generalized human motion generation, this paper proposes a unified generation framework, UniMotion, which supports multimodal inputs including text, image and audio. The method uses a unified prompt encoder to map different inputs into a shared cross-modal semantic space. It adopts a two-stage motion decoder to gradually generate fine-grained skeleton sequences. A multimodal alignment loss function is introduced to strengthen consistency modeling across different prompts. In semantic generalization evaluation and prompt consistency tests, UniMotion outperforms baseline methods by 7.3% and 8.9%, respectively. In random multimodal prompt switching tests, it maintains 92.4% motion stability and logical consistency, demonstrating good practicality and scalability. This study expands the application scope of multimodal generative models in human motion modeling.