Hierarchical Prompt Composition for Memory-Efficient Open-World Continual Learning in Vision-Language Foundation Models

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Foundation models pre-trained on web-scale data, such as CLIP, exhibit strong zero-shot visual recognition capabilities. However, their deployment in open-world scenarios is constrained by catastrophic forgetting and an inability to efficiently incorporate novel concepts without full retraining. This paper introduces the Hierarchical Prompt Composition Network (HPC-Net), a memory-efficient architecture that enables vision-language models to learn incrementally in open environments. HPC-Net maintains a dynamically evolving hierarchy of learnable prompt components that are composed to form task-specific representations while preserving the model's foundational zero-shot capabilities. The architecture exploits the hierarchical compositionality of visual concepts through a three-tier prompt decomposition: (1) foundational prompts encoding broad semantic primitives, (2) compositional prompts for mid-level visual patterns, and (3) instance prompts for category-specific features. A Semantic Prototype Anchoring mechanism is introduced to prevent semantic drift in the shared prompt space, and a Contrastive Prompt Routing module dynamically selects and combines prompts for each input. Extensive experiments across four open-world benchmarks (Split-CIFAR100, Split-ImageNet-R, CORe50, and a new medical imaging benchmark, MedStream-7k) demonstrate that HPC-Net achieves an average accuracy of $84.3 \pm 0.9\%$, a $5.4\%$ absolute improvement over the strongest baseline. This is accomplished while retaining $98.4\%$ of the base model's zero-shot performance on seen domains and requiring only $2.1$M additional parameters (11.6x fewer than adapter-based fusion methods). All code, datasets, and pre-trained models will be released to facilitate reproducibility.

Article activity feed