DSTAdapter:Divided Spatial-Temporal Adapter Fine-tuning Method for Sign Language Recognition
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The commonly adopted approach of full fine-tuning for video-based sign language recognition models encounters two critical limitations: high computational resource consumption and compromised generalization capabilities. To address these challenges, we propose DSTAdapter, a parameter-efficient transfer learning framework that activates frozen CLIP models for video understanding through spatial-temporal decoupled adaptation. Our methodology introduces three key technical contributions: (1) dual-branch adapter architecture with separate adapter branches dedicated to capturing spatial hand shapes and temporal gesture dynamics, (2) channel-aware feature fusion modules that dynamically optimize the interaction between adapter-enhanced features and backbone representations, and (3) a lightweight framework design enabling efficient deployment on resource-constrained devices. Requiring only 4% tunable parameters, the proposed method establishes new state-of-the-art performance across four benchmark sign language datasets. Comprehensive evaluations demonstrate significant efficiency improvements, particularly on the Bukva benchmark where DSTAdapter achieves a 30% reduction in training time and 60% decrease in GPU memory consumption compared to conventional full fine-tuning approaches. The compact architecture further facilitates practical multitask deployment scenarios. These technical advancements present a promising solution for developing real-world assistive technologies, particularly benefiting hearing-impaired communities through improved accessibility. The code of this work is available at https://github.com/BLOOM0-0/DSTAdapter.