Research on multimodal conditional diffusion image translation technology based on dynamic door control and attention masking
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
In image translation from virtual to real-world autonomous driving scenarios, the conditional diffusion model for multimodal data fusion employs a multi-head self-attention mechanism to model cross-modal global dependencies between semantic segmentation maps and depth maps for scene generation. However, it still has limitations: the translated results exhibit discrepancies in the consistency between semantic contours and spatial depth, failing to meet high-precision requirements; multimodal data quality is uneven, and using a fixed-weight fusion method is susceptible to the influence of low-quality modalities; background noise and modal noise weaken the constraining effect of key features on the denoising process. To address these issues, this paper proposes an improved multi-modal feature fusion framework. The multi-head self-attention mechanism incorporates a dynamic gating module, enabling cross-modal feature weights to be adaptively modulated through spatial semantic importance assessment and channel-modal contribution quantification, thereby balancing the proportions of high- and low-quality modalities. An attention masking mechanism is introduced, using learnable masks to filter out interference from non-critical regions, thereby enhancing the representation of core elements such as vehicles and traffic signs. The optimized multimodal features are incorporated into the denoising process of the diffusion model as dual conditions, guiding the noise prediction network to align semantic and depth constraints at the pixel level, thereby achieving precise translation from virtual to real scenes. Experimental results show that the improved model achieves significantly enhanced translation accuracy on the Cityscapes dataset, with superior performance on metrics such as Fréchet Inception Distance (FID) and Learned Perceptual Image Patch Similarity (LPIPS), achieving values of 42.80 and 0.412, respectively.