DSTAdapter:Divided Spatial-Temporal Adapter Fine-tuning Method for Sign Language Recognition
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The commonly adopted approach of full fine-tuning for video-based sign language recognition models encounters two critical limitations: high computational resource consumption and compromised generalization capabilities. To address these challenges, we propose DSTAdapter, a parameter-efficient transfer learning framework that activates frozen CLIP models for video understanding through spatial-temporal decoupled adaptation. Our methodology introduces three key technical contributions: (1) dual-branch adapter architecture with separate adapter branches dedicated to capturing spatial hand shapes and temporal gesture dynamics, (2) channel-aware feature fusion modules that dynamically optimize the interaction between adapter-enhanced features and backbone representations, and (3) a lightweight framework design enabling efficient deployment on resource-constrained devices. Requiring only 4% tunable parameters, the proposed method establishes new state-of-the-art performance across four benchmark sign language datasets. Comprehensive evaluations demonstrate significant efficiency improvements, particularly on the Bukva benchmark where DSTAdapter achieves a 30% reduction in training time and 60% decrease in GPU memory consumption compared to conventional full fine-tuning approaches. The compact architecture further facilitates practical multitask deployment scenarios. These technical advancements present a promising solution for developing real-world assistive technologies, particularly benefiting hearing-impaired communities through improved accessibility. The code of this work is available at https://github.com/BLOOM0-0/DSTAdapter.