CFD-CLIP: Contrastive Feature Distillation with CLIP for Image Classification

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Recent contrastive vision–language models (CLIP) excel at few-shot learning but are often too large for practical deployment. To enable efficient usage, we propose a CLIP-supervised distillation framework that transfers its multimodal knowledge into lightweight vision-only networks. Unlike conventional unimodal distillation, our method uses a dual-contrastive loss to align student visual features with CLIP’s image–text embedding space, leveraging text embeddings as semantic anchors to preserve class-level feature structure. Experiments on CIFAR-100 and ImageNet show that our approach improves MobileNet accuracy by 4.83\% and outperforms existing distillation baselines, providing a compact yet semantically aligned model for efficient deployment. Code is available at https://github.com/pandeng-001/CFD-CLIP.

Article activity feed