Co-Training Multimodal World Models and Diffusion-Guided Policies for Zero-Shot Contact- Rich Manipulation
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
This study presents MWM‑DD (Multimodal World Model–Diffusion Decision), a unified control architecture that couples: (i) a vision–tactile–proprioceptive encoder aligned in a shared predictive latent; (ii) a Transformer‑VAE world model that captures latent dynamics and rewards; and (iii) a constraint‑ and value‑guided diffusion decision module realized as a receding‑horizon controller. The system is trained end‑to‑end using reconstruction and Kullback–Leibler (KL) annealing, cross‑modal InfoNCE, and a diffusion ELBO objective, plus kernel‑based off‑policy improvements and domain randomization. Across CALVIN (language‑conditioned rearrangements), FurnitureBench (multi‑stage assemblies), and the Functional Manipulation Benchmark (FMB; single‑ and multi‑object assemblies), the approach achieves state‑of‑the‑art zero‑shot performance and transfers to a Franka Panda + DIGIT platform with small sim‑to‑real gaps. Representative outcomes include 76.7% average simulation success and 69.4% on hardware (20 trials per task), outperforming a strong vision‑only diffusion baseline (53.1% / 45.3%) while maintaining a 7.3‑point sim‑to‑real gap. Ablation results attribute substantial contributions to diffusion (+ 22–27 points), tactile sensing (+ 15–18), cross‑modal alignment (+ 10–13), and zero‑shot regularization (+ 5–10). These findings support the thesis that co‑training predictive latent dynamics with guided diffusion facilitates robust, zero‑shot generalization in contact‑rich manipulation without test‑time fine‑tuning (Chi et al., 2024; Mees et al., 2022; Heo et al., 2025; Luo et al., 2024).