Zero-shot adapter framework for cross-modal classification of remote sensing imagery
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Vision-language foundation models exhibit significant potential for open-world remote sensing applications; however, they face considerable challenges, including limited generic prompts, the scarcity of annotated datasets, and insufficient feature extraction. In response to these challenges, we introduce a novel zero-shot adapter framework for cross-modal imagery classification in remote sensing. This framework uniquely integrates three essential components: (1) an LLM-Augmented Prompt Generalization component is specifically designed for remote sensing classes and enhances the semantic depth of textual prompts by incorporating domain-specific knowledge from large language models (LLMs), thereby facilitating more contextually rich interpretations in remote sensing classification. (2) a Proxy-Enhanced Support Set Construction mechanism is developed to generate pseudo-label support sets, addressing the critical issue of limited annotated data and providing a robust mechanism for knowledge expansion. (3) a Multi-Granularity Feature Cache approach is introduced, which stores both local (patch-level) and global (scene-level) features. This module effectively combines feature caching with zero-shot CLIP predictions, thereby bridging the semantic gap between image and text domains in remote sensing. The synergistic interaction of these components enhances semantic grounding through LLM-augmented prompts and proxy support sets, while feature caching and proxy learning collaboratively address insufficient feature extraction. The proposed framework exhibits superior efficacy in resource-constrained environments. Experimental evaluations across five benchmark datasets indicate promising zero-shot and few-shot prediction performance, showing an improvement over existing cross-modal methodologies.