MiMo-Embodied: X-Embodied Foundation Model

Xiaoshuai Hao
Lei Zhou
Zhijian Huang
Zhiwen Hou
Yingbo Tang
Lingfeng Zhang
Guang Li
Zheng Lu
Shuhuai Ren
Xianhui Meng
Yuchen Zhang
Jing Wu
Jinghui Lu
Chenxu Dang
Jiayi Guan
Jianhua Wu
Zhiyi Hou
Hanbing Li
Shumeng Xia
Mingliang Zhou
Yinan Zheng
Zihao Yue
Shuhao Gu
Hao Tian
Yuannan Shen
Jianwei Cui
Wen Zhang
Shaoqing Xu
Bing Wang
Haiyang Sun
Zeyu Zhu
Yuncheng Jiang
Zibin Guo
Chuhong Gong
Chaofan Zhang
Wenbo Ding
Kun Ma
Guang Chen
Rui Cai
Diyun Xiang
Heng Qu
Fuli Luo
Hangjun Ye
Long Chen

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

MiMo-Embodied is a groundbreaking cross-embodied foundation model that integrates indoor robotics and outdoor autonomous driving in one model, effectively addressing the historical ''domain gap'' that has siloed Embodied AI and driving systems. The model achieves state-of-the-art performance across 29 benchmarks, comprising 17 tasks in Embodied AI (including task planning, affordance prediction and spatial understanding) and 12 in autonomous driving (spanning environmental perception, status prediction, and driving planning). Central to these gains is a multi-stage training strategy that couples curated general and domain-specific data with supervised alignment, chain-of-thought reasoning supervision, and reinforcement learning, yielding robust positive transfer and mutual reinforcement between embodiments. The model consistently outperforms specialized, open-source, and closed-source counterparts while retaining general visual semantic understanding. These results indicate that a single vision-language foundation model can acquire cohesive physical intelligence across diverse embodiments and environments, suggesting that carefully staged, multimodal training is sufficient to unlock cross-task generalization without architectural specialization. This work offers a scalable framework for developing unified embodied systems. Code and models are available at https://github.com/XiaomiMiMo/MiMo-Embodied.

Version published to 10.21203/rs.3.rs-8493355/v1 on Research Square
Feb 5, 2026

Implicit Semantic Control Manifolds for Learning-Enabled Multi-UAV Coordination

This article has 4 authors:
1. Bryan Starbuck
2. Won Jang
3. Saee Sholapurkar
4. Bert Bras
This article has no evaluationsLatest version Mar 24, 2026
Learning to Model the World: A Survey of World Models in Artificial Intelligence

This article has 19 authors:
1. Jiahua Dong
2. Qi Lyu
3. Baichen Liu
4. Xudong Wang
5. Wenqi Liang
6. Duzhen Zhang
7. Jiahang Tu
8. Hongliu Li
9. Hanbin Zhao
10. Henghui Ding
11. Yulun Zhang
12. Zhi Han
13. Nicu Sebe
14. Fahad Shahbaz Khan
15. Salman Khan
16. Mubarak Shan
17. Philip Torr
18. Ming-Hsuan Yang
19. Dacheng Tao
This article has no evaluationsLatest version Mar 10, 2026
Decoupled Multi-Dimensional Reinforcement Learning with Temporal Communication for Vision-Based UAV Control in Partially Observable Environments

This article has 4 authors:
1. Dapeng Ji
2. Shike Yang
3. Weidong Liu
4. Li Jingchen
This article has no evaluationsLatest version Feb 13, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Implicit Semantic Control Manifolds for Learning-Enabled Multi-UAV Coordination

Learning to Model the World: A Survey of World Models in Artificial Intelligence

Decoupled Multi-Dimensional Reinforcement Learning with Temporal Communication for Vision-Based UAV Control in Partially Observable Environments