Bidirectional Translation of ASL and English Using Machine Vision and CNN and Transformer Networks

Stefanie Amiruzzaman
Md Amiruzzaman
Raga Mouni Batchu
James Dracup
Alexander Pham
Benjamin Crocker
Linh Ngo
M. Ali Akber Dewan

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

This study presents a real-time, bidirectional system for translating American Sign Language (ASL) to and from English using computer vision and transformer-based models to enhance accessibility for deaf and hard of hearing users. Leveraging publicly available sign language and text–to-gloss datasets, the system integrates MediaPipe-based holistic landmark extraction with CNN- and transformer-based architectures to support translation across video, text, and speech modalities within a web-based interface. In the ASL-to-English direction, the sign-to-gloss model achieves a 25.17% word error rate (WER) on the RWTH-PHOENIX-Weather 2014T benchmark, which is competitive with recent continuous sign language recognition systems, and the gloss-level translation attains a ROUGE-L score of 79.89, indicating strong preservation of sign content and ordering. In the reverse English-to-ASL direction, the English-to-Gloss transformer trained on ASLG-PC12 achieves a ROUGE-L score of 96.00, demonstrating high-fidelity gloss sequence generation suitable for landmark-based ASL animation. These results highlight a favorable accuracy-efficiency trade-off achieved through compact model architectures and low-latency decoding, supporting practical real-time deployment.

Version published to 10.3390/computers15010020
Jan 4, 2026
Version published to 10.20944/preprints202512.0685.v1
Dec 8, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed