An End-to-End Bengali Speech-to-Sign Language Generation Framework Using Fine-Tuned Whisper ASR and Grapheme-Level Visual Mapping

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

This paper provides an end-to-end speech-to-sign language generation model using fine-tuned Whisper-small automatic speech recognition and a grapheme-level visual mapping unit for sign synthesis. The system closes the very communication gap which the Bengali-speaking deaf and hard-of-hearing people are confronted with by providing real-time translation of spoken Bengali to synchronized sign language videos. The ASR module is optimized by strategic layer freezing, Bengali-specific text normalization, and fine-tuning on the Common Voice 13.0 (bn) dataset with a word error rate (WER) of 35.41% and character error rate (CER) of 11.45% for 5.5k fine-tuning steps. The transcribed text is split into its component graphemes by a customized regular expression to handle intricate Bengali compound characters and diacritical marks. They are then projected onto their signed label in sign language from a pre-curated image database of Bangla Sign Language. With OpenCV, annotated and aligned into their valid sequences of signs, the images produce interpretable video output at a fixed frame rate. The system was compared against several baseline Bengali ASR models, which were discovered to perform higher transcription accuracy while including explainable visual output missing in prior works. In addition to its demonstration of superior performance, the system also provides the scalability to other sign systems and languages because it is modular. This work is a new, realistic, and culturally appropriate assistive technology, providing improved access for the Bengali-speaking community of deaf and hard-of-hearing and paving the way for future speech–sign bidirectional translation systems.

Article activity feed