MiMo-Embodied: X-Embodied Foundation Model

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

MiMo-Embodied is a groundbreaking cross-embodied foundation model that integrates indoor robotics and outdoor autonomous driving in one model, effectively addressing the historical ''domain gap'' that has siloed Embodied AI and driving systems. The model achieves state-of-the-art performance across 29 benchmarks, comprising 17 tasks in Embodied AI (including task planning, affordance prediction and spatial understanding) and 12 in autonomous driving (spanning environmental perception, status prediction, and driving planning). Central to these gains is a multi-stage training strategy that couples curated general and domain-specific data with supervised alignment, chain-of-thought reasoning supervision, and reinforcement learning, yielding robust positive transfer and mutual reinforcement between embodiments. The model consistently outperforms specialized, open-source, and closed-source counterparts while retaining general visual semantic understanding. These results indicate that a single vision-language foundation model can acquire cohesive physical intelligence across diverse embodiments and environments, suggesting that carefully staged, multimodal training is sufficient to unlock cross-task generalization without architectural specialization. This work offers a scalable framework for developing unified embodied systems. Code and models are available at https://github.com/XiaomiMiMo/MiMo-Embodied.

Article activity feed