Zero-shot adapter framework for cross-modal classification of remote sensing imagery

Yong Sun
Qianxi Cheng
Weijian Xie
Hongxin Huang
Chengcheng Gu

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Vision-language foundation models exhibit significant potential for open-world remote sensing applications; however, they face considerable challenges, including limited generic prompts, the scarcity of annotated datasets, and insufficient feature extraction. In response to these challenges, we introduce a novel zero-shot adapter framework for cross-modal imagery classification in remote sensing. This framework uniquely integrates three essential components: (1) an LLM-Augmented Prompt Generalization component is specifically designed for remote sensing classes and enhances the semantic depth of textual prompts by incorporating domain-specific knowledge from large language models (LLMs), thereby facilitating more contextually rich interpretations in remote sensing classification. (2) a Proxy-Enhanced Support Set Construction mechanism is developed to generate pseudo-label support sets, addressing the critical issue of limited annotated data and providing a robust mechanism for knowledge expansion. (3) a Multi-Granularity Feature Cache approach is introduced, which stores both local (patch-level) and global (scene-level) features. This module effectively combines feature caching with zero-shot CLIP predictions, thereby bridging the semantic gap between image and text domains in remote sensing. The synergistic interaction of these components enhances semantic grounding through LLM-augmented prompts and proxy support sets, while feature caching and proxy learning collaboratively address insufficient feature extraction. The proposed framework exhibits superior efficacy in resource-constrained environments. Experimental evaluations across five benchmark datasets indicate promising zero-shot and few-shot prediction performance, showing an improvement over existing cross-modal methodologies.

Version published to 10.21203/rs.3.rs-7388684/v1 on Research Square
Sep 9, 2025

CLIP-Driven with Dynamic Feature Selection and Alignment Network for Referring Remote Sensing Image Segmentation

This article has 7 authors:
1. Qianqi Lu
2. Yuxiang Xie
3. Jing Zhang
4. Yanming Guo
5. Yingmei Wei
6. Jie Jiang
7. Xidao Luan
This article has no evaluationsLatest version Sep 17, 2025
A Multi-Scale Feature Fusion Dual-Branch Mamba-CNN Network for Landslide Extraction

This article has 3 authors:
1. Zhiheng Yang
2. Hua Zhang
3. Nanshan Zheng
This article has no evaluationsLatest version Sep 2, 2025
SCM: Semantic Segmentation with Dual-Stream Semantic Synergy under Adverse Weather Conditions

This article has 5 authors:
1. Shuochen Tian
2. Jian Pang
3. Jin Wang
4. Bingfeng Zhang
5. Weifeng Liu
This article has no evaluationsLatest version Sep 3, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

CLIP-Driven with Dynamic Feature Selection and Alignment Network for Referring Remote Sensing Image Segmentation

A Multi-Scale Feature Fusion Dual-Branch Mamba-CNN Network for Landslide Extraction

SCM: Semantic Segmentation with Dual-Stream Semantic Synergy under Adverse Weather Conditions