TARGAMA: A Novel Benchmark Dataset and Framework for Translating Dialectal Arabic to English with Generative Language Models

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Arabic, one of the world’s most widely spoken languages, is marked by extensive dialectal variation that often differs significantly from Modern Standard Arabic (MSA) and from other dialects. This linguistic diversity presents considerable challenges for machine translation systems, especially when translating dialectal Arabic into MSA or English. Addressing this gap, this work introduces TARGAMA, a novel benchmark dataset and framework designed to improve dialectal Arabic-English translation by leveraging state-of-the-art generative language models. As part of this work, we develop the largest known multi-dialectal Arabic-English parallel corpus, covering six major dialects. Using this dataset, we evaluate a variety of generative language models and propose a unified framework for dialect-aware translation. Our approach demonstrates strong performance across dialects and offers a scalable solution for improving translation quality in low-resource and linguistically diverse settings.

Article activity feed