DSTAdapter:Divided Spatial-Temporal Adapter Fine-tuning Method for Sign Language Recognition

Qiuhong Tian
Yijie Yang
Bin Chen
Jiacheng Chen
Junxiao Ning
Lizao Zhang

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The commonly adopted approach of full fine-tuning for video-based sign language recognition models encounters two critical limitations: high computational resource consumption and compromised generalization capabilities. To address these challenges, we propose DSTAdapter, a parameter-efficient transfer learning framework that activates frozen CLIP models for video understanding through spatial-temporal decoupled adaptation. Our methodology introduces three key technical contributions: (1) dual-branch adapter architecture with separate adapter branches dedicated to capturing spatial hand shapes and temporal gesture dynamics, (2) channel-aware feature fusion modules that dynamically optimize the interaction between adapter-enhanced features and backbone representations, and (3) a lightweight framework design enabling efficient deployment on resource-constrained devices. Requiring only 4% tunable parameters, the proposed method establishes new state-of-the-art performance across four benchmark sign language datasets. Comprehensive evaluations demonstrate significant efficiency improvements, particularly on the Bukva benchmark where DSTAdapter achieves a 30% reduction in training time and 60% decrease in GPU memory consumption compared to conventional full fine-tuning approaches. The compact architecture further facilitates practical multitask deployment scenarios. These technical advancements present a promising solution for developing real-world assistive technologies, particularly benefiting hearing-impaired communities through improved accessibility. The code of this work is available at https://github.com/BLOOM0-0/DSTAdapter.

Version published to 10.21203/rs.3.rs-6259023/v1 on Research Square
Mar 24, 2025

MSDS-FusionNet: A Multi-Scale Dual-Stream Fusion Network for High-Accuracy sEMG-Based Gesture Classification

This article has 5 authors:
1. Dongyi He
2. Wei Liu
3. He Yan
4. Yun Zhao
5. Bin Jiang
This article has no evaluationsLatest version Mar 25, 2025
Channel Splitting Attention for Enhanced Person Re-Identification: A CSA-TOPDB Approach

This article has 2 authors:
1. Hamed Abdollahi
2. Neda Faraji
This article has no evaluationsLatest version Apr 21, 2025
Dynamic Fusion of Multi-Scale Perception and Adaptive Discrimination for Compressed GANs

This article has 2 authors:
1. Rui Li
2. ChangHao Ge
This article has no evaluationsLatest version May 8, 2025

Listed in

Abstract

Article activity feed

Related articles

MSDS-FusionNet: A Multi-Scale Dual-Stream Fusion Network for High-Accuracy sEMG-Based Gesture Classification

Channel Splitting Attention for Enhanced Person Re-Identification: A CSA-TOPDB Approach

Dynamic Fusion of Multi-Scale Perception and Adaptive Discrimination for Compressed GANs