CLIP-Driven with Dynamic Feature Selection and Alignment Network for Referring Remote Sensing Image Segmentation

Qianqi Lu
Yuxiang Xie
Jing Zhang
Yanming Guo
Yingmei Wei
Jie Jiang
Xidao Luan

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Referring Remote Sensing Image Segmentation (RRSIS) aims to accurately locate and segment target objects in high-resolution aerial imagery based on natural language descriptions. The current RRSIS model faces numerous gaps due to the significant differences between remote sensing images and natural images, including scale variations, object rotation, and the difficulty of matching complex linguistic queries with spatially variable targets. Existing methods often rely on high-level semantic features or multi-stage cross-modal alignment, resulting in long training times and inefficiencies with complex queries. In this context, we propose CLIP-Driven with Dynamic Feature Selection and Alignment Network(CD2FSAN) , a novel framework that includes information-theoretic feature selection, adaptive multi-scale aggregation and alignment , and dynamic rotation correction decoder to better align remote sensing visual features with textual descriptions. Specially CD2FSAN dynamically selects visual features that best match the language description based on the principle of maximizing cross-modal information, alleviating domain shift from the pretraining of CLIP on natural images, and integrates language information during encoding. The framework also incorporates a multi-scale feature aggregation and alignment mechanism, ensuring precise cross-modal alignment, particularly for small targets. Additionally, CD2FSAN introduces a differentiable affine transformation-based dynamic rotation correction mechanism, enabling the network to adaptively adjust object orientations, improving segmentation accuracy. Experiments on three standard datasets, RefSegRS, RRSIS-D, and RISBench, demonstrate CD2FSAN’s superior performance in terms of oIoU, mIoU, and precision. Ablation studies and qualitative visualizations validate the efficacy of each module, confirming the framework’s robustness in handling spatial variations, rotation, and cross-modal alignment, significantly reducing the cross modality gap in CLIP-based single-stage RRSIS tasks.

Version published to 10.20944/preprints202509.1502.v1
Sep 17, 2025

Zero-shot adapter framework for cross-modal classification of remote sensing imagery

This article has 5 authors:
1. Yong Sun
2. Qianxi Cheng
3. Weijian Xie
4. Hongxin Huang
5. Chengcheng Gu
This article has no evaluationsLatest version Sep 9, 2025
A Multi-Scale Feature Fusion Dual-Branch Mamba-CNN Network for Landslide Extraction

This article has 3 authors:
1. Zhiheng Yang
2. Hua Zhang
3. Nanshan Zheng
This article has no evaluationsLatest version Sep 2, 2025
Lightweight Super-Resolution Reconstruction Architecture of Remote Sensing Images Using a Residual Hierarchical Transformer Network

This article has 5 authors:
1. Bo Huang
2. Jian Lin
3. Qingtang Chen
4. Yiqing Cao
5. Liaoni Wu
This article has no evaluationsLatest version Sep 17, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Zero-shot adapter framework for cross-modal classification of remote sensing imagery

A Multi-Scale Feature Fusion Dual-Branch Mamba-CNN Network for Landslide Extraction

Lightweight Super-Resolution Reconstruction Architecture of Remote Sensing Images Using a Residual Hierarchical Transformer Network