Decoupled Text-Guided Distillation for Efficient Federated Learning on Edge Devices

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Federated Learning (FL) enables the collaborative training of models across heterogeneous edge devices while preserving data privacy; however, its performance degrades significantly under domain shift. While integrating Vision-Language Models (VLMs) can mitigate this, existing prompt-tuning methods typically remain coupled to the massive VLM backbone during inference, rendering them impractical for resource-constrained edge devices. To address this challenge, we propose CLIP-assisted Domain-Invariant Federated Learning (CDIFed), which decouples the VLM from the deployment model to enhance robustness without incurring high inference latency. This framework integrates a Text-Guided Domain Adapter, implemented as a parameter-efficient bottleneck module, which aligns visual features with invariant text-based anchors to filter domain-specific noise while maintaining class-discriminative semantics. CDIFed operates through a communication-efficient two-phase framework: clients first adapt a frozen CLIP teacher, and then the adapted teacher supervises the training of a lightweight student network via feature knowledge distillation. Unlike previous approaches, the heavy VLM is discarded after adaptation, and only the student model parameters are transmitted to the server for aggregation. Experiments on the Digits and Office-Caltech benchmarks demonstrate that CDIFed significantly outperforms state-of-the-art methods in federated domain generalisation while maintaining the inference efficiency required for heterogeneous edge devices.

Article activity feed