CFD-CLIP: Contrastive Feature Distillation with CLIP for Image Classification

Maohai Pang
Weiwei Zhang
Xiao bin Li
Jianqing Zhu

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Recent contrastive vision–language models (CLIP) excel at few-shot learning but are often too large for practical deployment. To enable efficient usage, we propose a CLIP-supervised distillation framework that transfers its multimodal knowledge into lightweight vision-only networks. Unlike conventional unimodal distillation, our method uses a dual-contrastive loss to align student visual features with CLIP’s image–text embedding space, leveraging text embeddings as semantic anchors to preserve class-level feature structure. Experiments on CIFAR-100 and ImageNet show that our approach improves MobileNet accuracy by 4.83\% and outperforms existing distillation baselines, providing a compact yet semantically aligned model for efficient deployment. Code is available at https://github.com/pandeng-001/CFD-CLIP.

Version published to 10.21203/rs.3.rs-7464307/v1 on Research Square
Sep 22, 2025

ViT-CAAC: Contribution-Aware Adaptive Compression Framework for Vision Transformers

This article has 6 authors:
1. YU ZHANG
2. Shujun Peng
3. Yuheng Xiao
4. Xinhan Lin
5. Yang Hu
6. Shouyi Yin
This article has no evaluationsLatest version Sep 12, 2025
Optimising Few-Shot Class-Incremental Learning for Fine-Grained Visual Recognition

This article has 5 authors:
1. Yimin Yin
2. Haoling Liu
3. Sihang Xu
4. Renye Zhang
5. Jinghua Zhang
This article has no evaluationsLatest version Oct 6, 2025
FiT: Feature Integration Transformer with Universal Language Interface for Multi-Task Vision

This article has 6 authors:
1. Sana Cheema
2. Ghulam Gilanie
3. Tariq Alsahfi
4. Sami Alesawi
5. Raed Alsini
6. Ali Daud
This article has no evaluationsLatest version Oct 9, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

ViT-CAAC: Contribution-Aware Adaptive Compression Framework for Vision Transformers

Optimising Few-Shot Class-Incremental Learning for Fine-Grained Visual Recognition

FiT: Feature Integration Transformer with Universal Language Interface for Multi-Task Vision