Two-Stage Fine-tuning CLIP by Introducing Structure Knowledge for Few-shot Classification

Zhe Zhang
Xiang-Gui Guo
Junbao Zhuo
Huimin Ma

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Abstract concepts containing structural information, such as tangram, are often used in cognitive psychology to explore spatial reasoning and visual cognition. Inspired by this way, we propose a newly simple but effective fine-tuning method called two-stage fine-tuning contrastive language-image pretraining (TSF-CLIP). In stage I, the CLIP encoder is fine-tuned by the image-text matching task set on the tangram dataset, and the structure prior knowledge is captured by both two encoders of CLIP via a fine-tuning process. In stage II, to further improve the accuracy, the linear head is aligned to the domain of the specific downstream task. The proposed TSF-CLIP method can not only dynamically integrate structural prior knowledge and semantic information, but also avoid the shortcomings of large models with poor spatial logical reasoning ability and excessive fine-tuning parameters, and enhances the adaptability of the model to different downstream tasks. Experimental results demonstrate that the proposed TSF-CLIP substantially boosts the performance of the target model, which outperforms the existing few-shot image classification approach. Compared to the traditional CLIP, the average accuracy of the proposed TSF-CLIP is improved by 16.1% across 10 image recognition datasets. The code and related datasets can be found at https://github.com/Patrickeroo/TSF-CLIP.

Version published to 10.21203/rs.3.rs-5083690/v1 on Research Square
Jul 30, 2025
Version published to 10.1007/s00371-025-04080-8
Jun 30, 2025

Listed in

Abstract

Article activity feed