MSCDDF: Multi-Stage Caption-Driven Diffusion Framework for Remote Sensing Image Semantic Segmentation
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Stable Diffusion (SD) excels at natural image generation but faces challenges applying its diffusion process to high-altitude remote sensing (RS) semantic segmentation: perspective/semantic bias, attention limitations, and semantic caption gaps. To address these challenges, this paper proposes a Multi-Stage Caption-Driven Diffusion Framework (MSCDDF) with four synergistic modules. First, a remote sensing semantic-aligned caption generation (RSCG) approach is proposed to bridge text-visual gaps via two-stage semantic refinement. Second, an adaptive fine-tuning via instance cropping (AFT) strategy is designed to reduce domain bias by conditioning SD on single-class instances with RSCG-generated professional captions. Third, a dual-stream architecture called multi-layer weighted attention joint generation (MWAJG) is developed to decouple image/mask generation to enhance attention accuracy via cross/self-attention fusion. Fourth, a ground sampling distance-based synthetic data generation (GSD-SG) approach is proposed to improve dataset diversity through semantically constrained multi-scale object injection. Extensive experiments and ablation studies on several publicly available RS datasets collectively demonstrate the effectiveness of the proposed MSCDDF. Our source code is made publicly available at https://github.com/WangXin81/MSCDDF.