MvDeDiffusion: Multi-view Consistent Generation via Cross-view Deformable Attention for Denoising Diffusion Models

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Denoising diffusion models have demonstrated remarkable success in image generation, with numerous approaches achieving state-of-the-art synthesis quality. For autonomous driving applications, there is a critical need to extend these capabilities to multi-view image generation. However, achieving precise multi-view-consistent generation with 3D geometric awareness, critical for 3D perception tasks, remains challenging. Current approaches predominantly rely on overhead layout guidance, yet they frequently fail to maintain cross-view geometric coherence. This limitation manifests as misaligned object structures, discontinuous occlusions, and inconsistent depth relationships when synthesizing scenes from multiple angles. In this paper, we propose MvDeDiffusion, a diffusion-based framework for 3D-consistent multi-view image synthesis, which introduces two key innovations: (1) a cross-view deformable attention mechanism that explicitly enforces geometric and appearance consistency between adjacent viewpoints by adaptively aligning features domain in the denoising process, (2) a 3D-aware conditioning pipeline that integrates camera poses, foreground positional information, adjacent-view overlap to enable fine-grained control over scene structure while preserving photorealistic details. Our framework ensures view-consistent generation through explicit modeling of inter-perspective correlations during the diffusion process, overcoming the inherent limitations of independent per-view synthesis. Comprehensive experiments demonstrate that our model achieves:(1) superior multi-view continuity through geometrically coherent image synthesis,(2) maximizing controllability while preserving the richness of generated scenes.These advancements are quantitatively verified to significantly outperform existing approaches in both cross-view alignment fidelity and scene variation richness.

Article activity feed