Research on multimodal conditional diffusion image translation technology based on dynamic door control and attention masking

xiaoli zhang
mengxiang liu
yusi wang

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

In image translation from virtual to real-world autonomous driving scenarios, the conditional diffusion model for multimodal data fusion employs a multi-head self-attention mechanism to model cross-modal global dependencies between semantic segmentation maps and depth maps for scene generation. However, it still has limitations: the translated results exhibit discrepancies in the consistency between semantic contours and spatial depth, failing to meet high-precision requirements; multimodal data quality is uneven, and using a fixed-weight fusion method is susceptible to the influence of low-quality modalities; background noise and modal noise weaken the constraining effect of key features on the denoising process. To address these issues, this paper proposes an improved multi-modal feature fusion framework. The multi-head self-attention mechanism incorporates a dynamic gating module, enabling cross-modal feature weights to be adaptively modulated through spatial semantic importance assessment and channel-modal contribution quantification, thereby balancing the proportions of high- and low-quality modalities. An attention masking mechanism is introduced, using learnable masks to filter out interference from non-critical regions, thereby enhancing the representation of core elements such as vehicles and traffic signs. The optimized multimodal features are incorporated into the denoising process of the diffusion model as dual conditions, guiding the noise prediction network to align semantic and depth constraints at the pixel level, thereby achieving precise translation from virtual to real scenes. Experimental results show that the improved model achieves significantly enhanced translation accuracy on the Cityscapes dataset, with superior performance on metrics such as Fréchet Inception Distance (FID) and Learned Perceptual Image Patch Similarity (LPIPS), achieving values of 42.80 and 0.412, respectively.

Version published to 10.21203/rs.3.rs-8587856/v1 on Research Square
Feb 9, 2026

MFFP-Net: Multi-directional Feature Fusion and Position-Aware Network

This article has 4 authors:
1. Yazhong Si
2. Jingyu Chen
3. Hongxu Li
4. Chen Li
This article has no evaluationsLatest version Mar 9, 2026
CMAFNet: Efficient Cross-Modal Alignment and Fusion for Real-Time RGB–Infrared Object Detection in Autonomous Driving

This article has 3 authors:
1. Zi-Han Huang
2. Chen-Wei Liang
3. Mu-Jiang-Shan Wang
This article has no evaluationsLatest version Mar 5, 2026
A Dual-Modality Spatio-Temporal and Frequency Framework for Robust Deepfake Detection

This article has 7 authors:
1. Arman Sajjadi
2. Sayna Sarvar
3. Mobin Nekou
4. Mahdi Fallah
5. Delaram Mehralizadeh
6. Mohammad Hossein Jabbarzadeh
7. Pedram Salehpour
This article has no evaluationsLatest version Feb 27, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

MFFP-Net: Multi-directional Feature Fusion and Position-Aware Network

CMAFNet: Efficient Cross-Modal Alignment and Fusion for Real-Time RGB–Infrared Object Detection in Autonomous Driving

A Dual-Modality Spatio-Temporal and Frequency Framework for Robust Deepfake Detection