Entropy-Regularized Joint CTC–Attention Learning for Low-Resource Continuous Sign Language Recognition

Hanan A. Taher
Subhi R. M. Zeebaree

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Continuous Sign Language Recognition (CSLR) seeks to transcribe unsegmented sign language videos into gloss sequences without frame-level supervision, presenting persistent challenges in temporal alignment, long-range dependency modeling, and reliable sequence-level generalization. While recent advances have achieved strong performance in high-resource languages, Kurdish Sign Language (KrdSL) remains largely unexplored due to the absence of sentence-level benchmarks. To address this gap, we introduce KrdSL-1400 , the first continuous Kurdish Sign Language dataset, comprising 1,400 annotated video sequences covering 40 linguistically structured sentences performed by seven native signers, providing a standardized benchmark for low-resource CSLR. We propose a hybrid spatio-temporal CSLR framework that combines deep convolutional visual encoding with sequence-aware temporal modeling and a multi-task joint CTC–attention decoding strategy explicitly designed to address alignment uncertainty. The CTC objective enforces monotonic alignment, while an entropy-regularized multi-head attention mechanism dynamically emphasizes linguistically salient temporal segments, enabling robust sequence prediction without reliance on pose estimation or handcrafted features. Training dynamics exhibit stable and consistent convergence, with closely aligned training and validation WER curves indicating strong generalization. Quantitative evaluation shows that a CTC baseline achieves a WER of 13.5% , which is reduced to 10.5% using single-head attention, while the proposed model attains the best performance with a WER of 9.5% , corresponding to an approximate 30% relative improvement . Cross-dataset evaluation on the large-scale PHOENIX-2014-T benchmark further demonstrates generalization, achieving a WER of 13.7% and outperforming recent attention-based and transformer-based CSLR approaches.

Version published to 10.21203/rs.3.rs-8387768/v1 on Research Square
Feb 9, 2026

Word-level Afan Oromo Sign Language Recognition Using Deep Learning Approach

This article has 2 authors:
1. Solomon Endalu
2. Kula Kakeba
This article has no evaluationsLatest version Mar 25, 2026
Edge-Optimized AI-Powered Translator for Indian Sign Language (ISL)

This article has 6 authors:
1. R E Nischal
2. P Koti Darshan
3. Surya Narayan M
4. K A Ramita
5. P Shalini
6. Vijayalakshmi M N
This article has no evaluationsLatest version Mar 18, 2026
TAC-Net:Triple Attention Contrastive Network for Speech Complex Emotion Recognition in Real-Scene

This article has 3 authors:
1. Hankiz Yilahun
2. Chaobo Song
3. Askar Hamdulla
This article has no evaluationsLatest version Feb 26, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Word-level Afan Oromo Sign Language Recognition Using Deep Learning Approach

Edge-Optimized AI-Powered Translator for Indian Sign Language (ISL)

TAC-Net:Triple Attention Contrastive Network for Speech Complex Emotion Recognition in Real-Scene