HISF: Hierarchical Interactive Semantic Fusion for Multi-Modal Prompt Learning

Haohan Feng
Chen Li

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Recent vision-language pre-training models, like CLIP, have been shown to generalize well across a variety of multitask modalities. Nonetheless, their generalization for downstream tasks is limited. As a light-weight adaptation approach, prompt learning could allow task transfer by optimizing only several learnable vectors, and thus is more flexible for pre-trained models. However, current methods mainly concentrate on the design of unimodal prompts and ignore effective means for multimodal semantic fusion and label alignment, which limits their representation power. To tackle these problems, this paper designs a Hierarchical Interactive Semantic Fusion (HISF) framework for multimodal prompt learning. On top of frozen CLIP backbones, HISF injects visual and textual signals simultaneously in intermediate layers of a Transformer through a cross-attention mechanism as well as fitting category embeddings. This architecture realizes the hierarchical semantic fusion at the modality level with structural consistency kept at each layer. In addition, a Label Embedding Constraint and a Semantic Alignment Loss are proposed to promote category consistency while alleviating semantic drift in training. Extensive experiments across 11 few-shot image classification benchmarks show that HISF improves the average accuracy by around 0.7% compared to state-of-the-art methods and has remarkable robustness in cross-domain transfer tasks. Ablation studies also verify the effectiveness of each proposed part and their combination: hierarchical structure, cross-modal attention, and semantic alignment collaborate to enrich representational capacity. In conclusion, the proposed HISF is a new hierarchical view for multimodal prompt learning and provides a more lightweight and generalizable paradigm for adapting vision-language pre-trained models.

Version published to 10.20944/preprints202511.0717.v1
Nov 11, 2025

Unified Transformer Framework for Integrated Language -Vision Understanding and Content Generation

This article has 2 authors:
1. Anuj Attri
2. HariOm .
This article has no evaluationsLatest version Nov 4, 2025
Attention Re-Alignment in Multimodal Large Language Models via Intermediate-Layer Guidance

This article has 6 authors:
1. Yanming Chen
2. Pandong Wang
3. Guofeng Qin
4. Wei Wu
5. Ming Chen
6. Yongtao Hao
This article has no evaluationsLatest version Nov 14, 2025
Context-Guided Multi-Branch Fusion for Text-Dependent Visual Question Reasoning

This article has 4 authors:
1. Sander Ridder
2. Noor Verbeeck
3. Callum Hensley
4. Luca Vandenberghe
This article has no evaluationsLatest version Oct 27, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Unified Transformer Framework for Integrated Language -Vision Understanding and Content Generation

Attention Re-Alignment in Multimodal Large Language Models via Intermediate-Layer Guidance

Context-Guided Multi-Branch Fusion for Text-Dependent Visual Question Reasoning