CLIP-Driven with Dynamic Feature Selection and Alignment Network for Referring Remote Sensing Image Segmentation
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Referring Remote Sensing Image Segmentation (RRSIS) aims to accurately locate and segment target objects in high-resolution aerial imagery based on natural language descriptions. Most existing approaches either directly modify Referring Image Segmentation (RIS) frameworks originally designed for natural images or employ image-based foundation models such as SAM to improve segmentation accuracy. However, current RRSIS models still face substantial challenges due to the domain gap between remote sensing and natural images, including large-scale variations, arbitrary object rotations, and complex spatial–linguistic relationships. Consequently, such transfers often lead to weak cross-modal interaction, inaccurate semantic alignment, and reduced localization precision, particularly for small or rotated objects. In addition, approaches that rely on multi-stage alignment pipelines, redundant high-level feature fusion, or the incorporation of large foundation models generally incur substantial computational overhead and training inefficiency, especially when dealing with complex referring expressions in high-resolution remote sensing imagery. To address these challenges, we propose CD2FSAN, a CLIP-driven dynamic feature selection and alignment network that establishes a unified framework for fine-grained cross-modal understanding in remote sensing imagery. This network first follows the principle of maximizing cross-modal information to dynamically select the visual representations most semantically aligned with the language from CLIP’s hierarchical features, thereby strengthening cross-modal correspondence under image domain shifts. It then performs adaptive multi-scale aggregation and alignment to integrate linguistic cues into spatially diverse visual contexts, enabling precise feature fusion across varying object scales. Finally, a dynamic rotation correction decoder with differentiable affine transformation was designed to refine segmentation by compensating for orientation diversity and geometric distortions. Extensive experiments verify that CD2FSAN consistently outperforms existing methods in segmentation accuracy, validating the effectiveness of its core components while maintaining competitive computational efficiency. These results demonstrate the framework’s strong capability to bridge the cross-modal gap between language and remote sensing imagery, highlighting its potential for advancing semantic understanding in vision–language remote sensing tasks.