ARAFA: An LLM Generated Arabic Fact-Checking Dataset

Christophe Khalil
Shady Elbassuoni
Rida Assaf

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Automatic fact-checking poses a significant challenge in Arabic natural language processing due to the scarcity of datasets and resources. In this manuscript, we introduce ARAFA, a new large-scale dataset for fact-checking in Modern Standard Arabic, constructed through an automated framework leveraging large language models (LLMs). The dataset was constructed through a three-step pipeline: (1)claim generation from Arabic Wikipedia pages with supporting textual evidence,(2) claim mutation to generate challenging counterfactual claims with refuting ev-idence, and (3) an automatic validation step to validate that the generated claimsare either supported or refuted by their accompanying evidence, or if the evidencedoes not provide enough information to judge the validity of the claims. The resulting dataset comprises 181,976 claim-evidence pairs labeled as supported, refuted, or not enough information. Human evaluation carried out on a test sample from the dataset demonstrated strong inter-annotator agreement (κ = 0.89)using Cohen’s Kappa for supported claims and (κ = 0.94) for refuted claims. Automatic validation based on human-evaluated sample achieved 86% accuracy for supported claims and 88% for refuted ones. To showcase ARAFA’s value as a resource for automatic Arabic fact-checking, four open-source transformer-based models were fine-tuned using ARAFA, with the top-performing model achieving a Macro F1-score of 77% on the test data. In addition to ARAFA being the first large-scale dataset for Arabic fact-checking, our framework presents a scalable approach for developing similar resources for other low-resource languages.

Version published to 10.21203/rs.3.rs-7335564/v1 on Research Square
Aug 12, 2025

Jabuticaba: The largest commercial corpus for LLMs in Portuguese

This article has 4 authors:
1. Marcellus Amadeus
2. William Alberto Cruz Castaneda
3. José Roberto Homeli da Silva
4. Rodrigo Scotti
This article has no evaluationsLatest version Aug 5, 2025
Tokens with Meaning: A Hybrid Tokenization Approach for NLP

This article has 7 authors:
1. M. Ali Bayram
2. Ali Arda Fincan
3. Ahmet Semih Gümüş
4. Sercan Karakaş
5. Banu Diri
6. Savaş Yıldırım
7. Demircan Çelik
This article has no evaluationsLatest version Aug 6, 2025
Detecting Machine-Generated Arabic Text Using AraBERT and LSTM: Toward Trustworthy NLP in Low-Resource Languages

This article has 3 authors:
1. Tarek Barhoum
2. Mina Ibrahim
3. Mohamad Al Bali
This article has no evaluationsLatest version Aug 8, 2025

Listed in

Abstract

Article activity feed

Related articles

Jabuticaba: The largest commercial corpus for LLMs in Portuguese

Tokens with Meaning: A Hybrid Tokenization Approach for NLP

Detecting Machine-Generated Arabic Text Using AraBERT and LSTM: Toward Trustworthy NLP in Low-Resource Languages