TARGAMA: A Novel Benchmark Dataset and Framework for Translating Dialectal Arabic to English with Generative Language Models

Bouthaina Abdou
Hossam Elsafty
Farizeh Aldabbas
Maren Pielka
Rafet Sifa
Lucie Flek

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Arabic, one of the world’s most widely spoken languages, is marked by extensive dialectal variation that often differs significantly from Modern Standard Arabic (MSA) and from other dialects. This linguistic diversity presents considerable challenges for machine translation systems, especially when translating dialectal Arabic into MSA or English. Addressing this gap, this work introduces TARGAMA, a novel benchmark dataset and framework designed to improve dialectal Arabic-English translation by leveraging state-of-the-art generative language models. As part of this work, we develop the largest known multi-dialectal Arabic-English parallel corpus, covering six major dialects. Using this dataset, we evaluate a variety of generative language models and propose a unified framework for dialect-aware translation. Our approach demonstrates strong performance across dialects and offers a scalable solution for improving translation quality in low-resource and linguistically diverse settings.

Version published to 10.21203/rs.3.rs-8007453/v1 on Research Square
Nov 20, 2025

ArDQA: A Parallel Multidomain Benchmark for Cross-Dialectal Arabic Question Answering

This article has 1 author:
1. Maha Jarallah Althobaiti
This article has no evaluationsLatest version Nov 10, 2025
Evaluating Multilingual and Arabic Large Language Models for Quranic QA

This article has 3 authors:
1. Zakia Saadaoui
2. Ghassen Tlig
3. Fethi Jarray
This article has no evaluationsLatest version Nov 20, 2025
A Hybrid Machine Translation Framework for Low-Resource Indian Languages Using Differential Programming Loss Optimization

This article has 4 authors:
1. Rituraj Dixit
2. Sarabjeet Singh Bedi
3. Ibrahim Aljubayri
4. Mohammad Zubair Khan
This article has no evaluationsLatest version Oct 1, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

ArDQA: A Parallel Multidomain Benchmark for Cross-Dialectal Arabic Question Answering

Evaluating Multilingual and Arabic Large Language Models for Quranic QA

A Hybrid Machine Translation Framework for Low-Resource Indian Languages Using Differential Programming Loss Optimization