Bidirectional Translation of ASL and English Using Machine Vision and CNN and Transformer Networks
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
This study aims to develop a system for translating American Sign Language (ASL) to and from English, enhancing accessibility for ASL users. We leveraged a publicly available dataset to train a model that accurately predicts ASL signs and their English translations. The system employs AI-based transformers for bidirectional translation: converting text and speech into ASL using computer vision and translating ASL signs into text. For user accessibility, we built a web-based interface that integrates a computer vision framework (MediaPipe) to detect key body landmarks, including hands, shoulders, and facial features. This enables the system to process text, speech input, and video recordings, which are stored using msgpack and analyzed to generate ASL imagery. Additionally, we are developing a transformer model that is trained jointly on pairs of gloss sequences and sentences using connectionist temporal classification (CTC) and cross-entropy loss. Along with that, we are utilizing an EfficientNet-B0 pretrained on the ImageNet dataset with 1D convolution blocks to extract features from video frames, helping facilitate the conversion of ASL signs into structured English text.