Leveraging Large and Vision Language Models for Gloss-Free Sign Language Translation in Deaf Communication: A Survey

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Sign Language Translation (SLT) is an essential tool for overcoming communication barriers faced by the Deaf and Hard-of-Hearing (DHH) community, enabling equal access to vital information and interactions. This need is especially critical in large-scale, multilingual settings such as Hajj and Umrah, where millions of pilgrims, including DHH individuals, gather annually. Despite its significance, SLT remains challenging due to the multimodal nature of sign languages, which differ significantly from spoken languages. Although earlier surveys have reviewed foundational techniques such as gloss-based translation, attention mechanisms, and early machine learning models, they do not address the transformative impact of vision-language models and multimodal large-language models. This survey fills this gap by systematically reviewing state-of-the-art approaches enabled by Artificial Intelligence in SLT, facilitating seamless interaction between DHH and hearing individuals. We synthesize advancements across three key architectural paradigms: adapter-based language models, hierarchical tokenization frameworks, and vision-language pretraining. We also evaluate modern datasets and emerging evaluation metrics while comparing benchmark scores of the latest gloss-free frameworks. Furthermore, we explore current knowledge-based systems designed to assist DHH individuals, along with their limitations. By integrating these contributions, this survey advances SLT research and provides a roadmap for future innovations in low-resource adaptation, ethical AI development, and global accessibility initiatives.

Article activity feed