MvDeDiffusion: Multi-view Consistent Generation via Cross-view Deformable Attention for Denoising Diffusion Models

Bin Lu
Qing Li
Yanju Liang

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Denoising diffusion models have demonstrated remarkable success in image generation, with numerous approaches achieving state-of-the-art synthesis quality. For autonomous driving applications, there is a critical need to extend these capabilities to multi-view image generation. However, achieving precise multi-view-consistent generation with 3D geometric awareness, critical for 3D perception tasks, remains challenging. Current approaches predominantly rely on overhead layout guidance, yet they frequently fail to maintain cross-view geometric coherence. This limitation manifests as misaligned object structures, discontinuous occlusions, and inconsistent depth relationships when synthesizing scenes from multiple angles. In this paper, we propose MvDeDiffusion, a diffusion-based framework for 3D-consistent multi-view image synthesis, which introduces two key innovations: (1) a cross-view deformable attention mechanism that explicitly enforces geometric and appearance consistency between adjacent viewpoints by adaptively aligning features domain in the denoising process, (2) a 3D-aware conditioning pipeline that integrates camera poses, foreground positional information, adjacent-view overlap to enable fine-grained control over scene structure while preserving photorealistic details. Our framework ensures view-consistent generation through explicit modeling of inter-perspective correlations during the diffusion process, overcoming the inherent limitations of independent per-view synthesis. Comprehensive experiments demonstrate that our model achieves:(1) superior multi-view continuity through geometrically coherent image synthesis,(2) maximizing controllability while preserving the richness of generated scenes.These advancements are quantitatively verified to significantly outperform existing approaches in both cross-view alignment fidelity and scene variation richness.

Version published to 10.21203/rs.3.rs-7322439/v1 on Research Square
Aug 27, 2025

Remote Sensing Multi-View Stereo using ConvLSTM Guided Iterative Depth Refinement

This article has 5 authors:
1. Xuebin Wei
2. Yunxin Ye
3. Feng Cai
4. Liyan Wu
5. Feng Shao
This article has no evaluationsLatest version Sep 12, 2025
IncrementalDreamer: Scene-level 3D Generation with Incremental Optimization

This article has 4 authors:
1. Haiqi Zhu
2. Zihao Zhang
3. Qi Liu
4. Youdong Ding
This article has no evaluationsLatest version Sep 23, 2025
Domain-Invariant Dehazing via Depth-Aware Transmission Estimation and Image Restoration

This article has 7 authors:
1. Feng Ling
2. Yan Zhang
3. Zhiguang Shi
4. Jinghua Zhang
5. Yu Zhang
6. Di Liu
7. Wanchun Xu
This article has no evaluationsLatest version Oct 1, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Remote Sensing Multi-View Stereo using ConvLSTM Guided Iterative Depth Refinement

IncrementalDreamer: Scene-level 3D Generation with Incremental Optimization

Domain-Invariant Dehazing via Depth-Aware Transmission Estimation and Image Restoration