Hierarchical Prompt Composition for Memory-Efficient Open-World Continual Learning in Vision-Language Foundation Models
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Foundation models pre-trained on web-scale data, such as CLIP, exhibit strong zero-shot visual recognition capabilities. However, their deployment in open-world scenarios is constrained by catastrophic forgetting and an inability to efficiently incorporate novel concepts without full retraining. This paper introduces the Hierarchical Prompt Composition Network (HPC-Net), a memory-efficient architecture that enables vision-language models to learn incrementally in open environments. HPC-Net maintains a dynamically evolving hierarchy of learnable prompt components that are composed to form task-specific representations while preserving the model's foundational zero-shot capabilities. The architecture exploits the hierarchical compositionality of visual concepts through a three-tier prompt decomposition: (1) foundational prompts encoding broad semantic primitives, (2) compositional prompts for mid-level visual patterns, and (3) instance prompts for category-specific features. A Semantic Prototype Anchoring mechanism is introduced to prevent semantic drift in the shared prompt space, and a Contrastive Prompt Routing module dynamically selects and combines prompts for each input. Extensive experiments across four open-world benchmarks (Split-CIFAR100, Split-ImageNet-R, CORe50, and a new medical imaging benchmark, MedStream-7k) demonstrate that HPC-Net achieves an average accuracy of $84.3 \pm 0.9\%$, a $5.4\%$ absolute improvement over the strongest baseline. This is accomplished while retaining $98.4\%$ of the base model's zero-shot performance on seen domains and requiring only $2.1$M additional parameters (11.6x fewer than adapter-based fusion methods). All code, datasets, and pre-trained models will be released to facilitate reproducibility.