MSCDDF: Multi-Stage Caption-Driven Diffusion Framework for Remote Sensing Image Semantic Segmentation

Xin Wang
Jiali Wang

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Stable Diffusion (SD) excels at natural image generation but faces challenges applying its diffusion process to high-altitude remote sensing (RS) semantic segmentation: perspective/semantic bias, attention limitations, and semantic caption gaps. To address these challenges, this paper proposes a Multi-Stage Caption-Driven Diffusion Framework (MSCDDF) with four synergistic modules. First, a remote sensing semantic-aligned caption generation (RSCG) approach is proposed to bridge text-visual gaps via two-stage semantic refinement. Second, an adaptive fine-tuning via instance cropping (AFT) strategy is designed to reduce domain bias by conditioning SD on single-class instances with RSCG-generated professional captions. Third, a dual-stream architecture called multi-layer weighted attention joint generation (MWAJG) is developed to decouple image/mask generation to enhance attention accuracy via cross/self-attention fusion. Fourth, a ground sampling distance-based synthetic data generation (GSD-SG) approach is proposed to improve dataset diversity through semantically constrained multi-scale object injection. Extensive experiments and ablation studies on several publicly available RS datasets collectively demonstrate the effectiveness of the proposed MSCDDF. Our source code is made publicly available at https://github.com/WangXin81/MSCDDF.

Version published to 10.21203/rs.3.rs-6913245/v1 on Research Square
Jun 25, 2025

MMFNet: A Mamba-Based Multimodal Fusion Network for Semantic Segmentation of Remote Sensing

This article has 5 authors:
1. Jingting Qiu
2. Wei Chang
3. Wei Ren
4. Shanshan Hou
5. Ronghao Yang
This article has no evaluationsLatest version Aug 4, 2025
Geo-TCAM: A Thangka Captioning Method Integrating Topic Modeling with Geometry- Guided Spatial Attention

This article has 4 authors:
1. Ping Zhong
2. Wenjin Hu
3. Yinqiu Zhao
4. Fujun Zhang
This article has no evaluationsLatest version Jul 25, 2025
OG-HFYOLO: Orientation Gradient guidance and Heterogeneous feature fusion for deformation table cell instance segmentation

This article has 2 authors:
1. Long Liu
2. CiHui Yang
This article has no evaluationsLatest version Jul 31, 2025

Listed in

Abstract

Article activity feed

Related articles

MMFNet: A Mamba-Based Multimodal Fusion Network for Semantic Segmentation of Remote Sensing

Geo-TCAM: A Thangka Captioning Method Integrating Topic Modeling with Geometry- Guided Spatial Attention

OG-HFYOLO: Orientation Gradient guidance and Heterogeneous feature fusion for deformation table cell instance segmentation