Pose-based Contrastive Representation Learning for Sign Languages

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Sign language processing remains challenging due to the scarcity of large, well-annotated datasets and the strong linguistic specificity of sign languages across regions. Most existing sign language recognition and translation systems rely on gloss or text supervision, limiting their applicability to low-resource sign languages where such annotations are unavailable. In this work, we propose a pose-based contrastive representation-learning framework that learns sign language representations purely from articulatory structure, without relying on text or gloss labels. Each sign video is represented as a sequence of pose landmarks extracted using MediaPipe and encoded using a Transformer-based temporal model trained with a supervised contrastive objective. The model is trained on the ASL Citizen dataset and evaluated in a zero-shot cross-lingual retrieval setting on the INCLUDE and LSA64 datasets, representing Indian and Argentinian Sign Languages respectively. Experimental results demonstrate strong embedding discrimination in pairwise evaluation and high zero-shot retrieval performance, achieving Recall @1 of 97.12 percent on LSA64 and 88.6 percent on INCLUDE, with Recall@5 and Recall@10 exceeding 96 percent on both datasets. These results indicate that articulatory pose information alone is sufficient to learn robust and transferable sign representations across languages. The proposed approach offers a scalable alternative for sign language understanding in under-resourced settings and provides a foundation for cross-lingual retrieval and similarity-based sign learning applications.

Article activity feed