TARGAMA: A Novel Benchmark Dataset and Framework for Translating Dialectal Arabic to English with Generative Language Models
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Arabic, one of the world’s most widely spoken languages, is marked by extensive dialectal variation that often differs significantly from Modern Standard Arabic (MSA) and from other dialects. This linguistic diversity presents considerable challenges for machine translation systems, especially when translating dialectal Arabic into MSA or English. Addressing this gap, this work introduces TARGAMA, a novel benchmark dataset and framework designed to improve dialectal Arabic-English translation by leveraging state-of-the-art generative language models. As part of this work, we develop the largest known multi-dialectal Arabic-English parallel corpus, covering six major dialects. Using this dataset, we evaluate a variety of generative language models and propose a unified framework for dialect-aware translation. Our approach demonstrates strong performance across dialects and offers a scalable solution for improving translation quality in low-resource and linguistically diverse settings.