Open-Vocabulary Semantic Segmentation for Remote Sensing Imagery via Dual-Stream Feature Extraction and Category-Adaptive Refinement

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

This research addresses the critical challenge of Open-Vocabulary Remote Sensing Image Semantic Segmentation (OVRSISS), where models must accurately segment arbitrary text-queried categories in remote sensing imagery without prior training on those specific classes. Traditional methods are hindered by fixed vocabularies and the inherent complexities of remote sensing data. We propose RS-ZeroSeg, a novel end-to-end model that synergistically combines general Vision-Language Model (VLM) capabilities with specialized remote sensing knowledge. Key components include a Dual-Stream Feature Extractor (DSFE) for heterogeneous feature fusion, a Multi-Scale Contextual Alignment Module (MS-CAM) for multi-scale integration, and a Category-Adaptive Refinement Head (CARH) for text-driven segmentation. Trained on a comprehensive remote sensing dataset, RS-ZeroSeg consistently outperforms state-of-the-art OVRSISS methods across diverse benchmarks including FLAIR, FAST, ISPRS Potsdam, and FloodNet, achieving a new state-of-the-art average mIoU and demonstrating a substantial improvement over previous bests. Extensive ablation studies validate the significant contribution of each proposed module, while detailed analyses confirm superior generalization to novel categories, improved computational efficiency, and optimal training strategies, demonstrating RS-ZeroSeg's robustness and adaptability for practical remote sensing applications.

Article activity feed