A Unified Framework for Human Motion Generation with Multimodal Inputs

Nathan J. Blake
Isabella M. Cooper
Ryan A. Mitchell
Chloe S. Turner

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

To enable generalized human motion generation, this paper proposes a unified generation framework, UniMotion, which supports multimodal inputs including text, image and audio. The method uses a unified prompt encoder to map different inputs into a shared cross-modal semantic space. It adopts a two-stage motion decoder to gradually generate fine-grained skeleton sequences. A multimodal alignment loss function is introduced to strengthen consistency modeling across different prompts. In semantic generalization evaluation and prompt consistency tests, UniMotion outperforms baseline methods by 7.3% and 8.9%, respectively. In random multimodal prompt switching tests, it maintains 92.4% motion stability and logical consistency, demonstrating good practicality and scalability. This study expands the application scope of multimodal generative models in human motion modeling.

Version published to 10.21203/rs.3.rs-7467386/v1 on Research Square
Aug 28, 2025

A New Paradigm for Human Motion Generation Based on Cross-Modal Nested Alignment

This article has 5 authors:
1. Ethan M. Carter
2. Sophia L. Hayes
3. Benjamin T. Walker
4. Lucas J. Reynolds
5. Emily K. Foster
This article has no evaluationsLatest version Aug 28, 2025
Language-Driven 3D Skeleton-Based Motion Generation with Action Nesting Graph

This article has 4 authors:
1. Oliver J. Hart
2. Mia L. Franklin
3. Thomas R. Shields
4. Emily K. Dawson
This article has no evaluationsLatest version Aug 29, 2025
Enhancing Action Recognition via Dynamic Cross-Frame Differential Modeling

This article has 5 authors:
1. Qiuhong Tian
2. Tiancheng Chen
3. Lizao Zhang
4. Ziyu Yang
5. Fei Zeng
This article has no evaluationsLatest version Aug 18, 2025

Listed in

Abstract

Article activity feed

Related articles

A New Paradigm for Human Motion Generation Based on Cross-Modal Nested Alignment

Language-Driven 3D Skeleton-Based Motion Generation with Action Nesting Graph

Enhancing Action Recognition via Dynamic Cross-Frame Differential Modeling