It's All Connected: A Survey for Multimodal Arabic AI

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Multimodal AI integrates text, vision, and speech within unified reasoning frameworks, yet Arabic remains significantly underrepresented due to diglossia, morphological complexity, and scarce multimodal resources. This survey delivers the first comprehensive technical roadmap for Arabic multimodal AI, covering the progression from unimodal Arabic NLP, OCR, and ASR to recent Arabic-capable Multimodal Large Language Models (MLLMs). We review available multimodal datasets, modality encoders, tokenization approaches, connector designs, and fusion strategies used in state-of-the-art systems. We also provide the first consolidated evaluation of Arabic-capable MLLMs on multimodal benchmarks ARB and PEARL analyzing performance, robustness, and domain generalization across OCR-grounded and open-domain VQA settings. Despite recent progress, challenges persist in cultural grounding, dialect inclusivity, dataset scale, and open-access ecosystem maturity. We outline actionable directions for scalable and culturally aligned Arabic multimodal intelligence, including parameter-efficient adaptation, broader corpus development, and unified evaluation protocols. By consolidating technical advances and empirical insights, this survey establishes a foundation to guide the next generation of Arabic-centric multimodal research.

Article activity feed