Co-Training Multimodal World Models and Diffusion-Guided Policies for Zero-Shot Contact- Rich Manipulation

Lei Li¹
Tianrong Zhang²

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

This study presents MWM‑DD (Multimodal World Model–Diffusion Decision), a unified control architecture that couples: (i) a vision–tactile–proprioceptive encoder aligned in a shared predictive latent; (ii) a Transformer‑VAE world model that captures latent dynamics and rewards; and (iii) a constraint‑ and value‑guided diffusion decision module realized as a receding‑horizon controller. The system is trained end‑to‑end using reconstruction and Kullback–Leibler (KL) annealing, cross‑modal InfoNCE, and a diffusion ELBO objective, plus kernel‑based off‑policy improvements and domain randomization. Across CALVIN (language‑conditioned rearrangements), FurnitureBench (multi‑stage assemblies), and the Functional Manipulation Benchmark (FMB; single‑ and multi‑object assemblies), the approach achieves state‑of‑the‑art zero‑shot performance and transfers to a Franka Panda + DIGIT platform with small sim‑to‑real gaps. Representative outcomes include 76.7% average simulation success and 69.4% on hardware (20 trials per task), outperforming a strong vision‑only diffusion baseline (53.1% / 45.3%) while maintaining a 7.3‑point sim‑to‑real gap. Ablation results attribute substantial contributions to diffusion (+ 22–27 points), tactile sensing (+ 15–18), cross‑modal alignment (+ 10–13), and zero‑shot regularization (+ 5–10). These findings support the thesis that co‑training predictive latent dynamics with guided diffusion facilitates robust, zero‑shot generalization in contact‑rich manipulation without test‑time fine‑tuning (Chi et al., 2024; Mees et al., 2022; Heo et al., 2025; Luo et al., 2024).

Version published to 10.21203/rs.3.rs-7347334/v1 on Research Square
Oct 3, 2025

Zero-Shot Image Super-Resolution Using Prompt-Driven Vision-Language Foundation Models Without Task-Specific Fine-Tuning

This article has 1 author:
1. K. AKILA
This article has no evaluationsLatest version Sep 1, 2025
Cultural Heritage-Inspired Deep Framework forSports Action Recognition and Competition BehaviorAnalysis

This article has 1 author:
1. Zonghao Wang
This article has no evaluationsLatest version Oct 6, 2025
RLDSCP: Reducing Label Dependency with Self-Attention and Contrastive Pretraining

This article has 2 authors:
1. sai prabanjan kumar kalvapalli
2. MALA C
This article has no evaluationsLatest version Aug 27, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Zero-Shot Image Super-Resolution Using Prompt-Driven Vision-Language Foundation Models Without Task-Specific Fine-Tuning

Cultural Heritage-Inspired Deep Framework forSports Action Recognition and Competition BehaviorAnalysis

RLDSCP: Reducing Label Dependency with Self-Attention and Contrastive Pretraining