Entropy-Regularized Joint CTC–Attention Learning for Low-Resource Continuous Sign Language Recognition

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Continuous Sign Language Recognition (CSLR) seeks to transcribe unsegmented sign language videos into gloss sequences without frame-level supervision, presenting persistent challenges in temporal alignment, long-range dependency modeling, and reliable sequence-level generalization. While recent advances have achieved strong performance in high-resource languages, Kurdish Sign Language (KrdSL) remains largely unexplored due to the absence of sentence-level benchmarks. To address this gap, we introduce KrdSL-1400 , the first continuous Kurdish Sign Language dataset, comprising 1,400 annotated video sequences covering 40 linguistically structured sentences performed by seven native signers, providing a standardized benchmark for low-resource CSLR. We propose a hybrid spatio-temporal CSLR framework that combines deep convolutional visual encoding with sequence-aware temporal modeling and a multi-task joint CTC–attention decoding strategy explicitly designed to address alignment uncertainty. The CTC objective enforces monotonic alignment, while an entropy-regularized multi-head attention mechanism dynamically emphasizes linguistically salient temporal segments, enabling robust sequence prediction without reliance on pose estimation or handcrafted features. Training dynamics exhibit stable and consistent convergence, with closely aligned training and validation WER curves indicating strong generalization. Quantitative evaluation shows that a CTC baseline achieves a WER of 13.5% , which is reduced to 10.5% using single-head attention, while the proposed model attains the best performance with a WER of 9.5% , corresponding to an approximate 30% relative improvement . Cross-dataset evaluation on the large-scale PHOENIX-2014-T benchmark further demonstrates generalization, achieving a WER of 13.7% and outperforming recent attention-based and transformer-based CSLR approaches.

Article activity feed