It's All Connected: A Survey for Multimodal Arabic AI
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Multimodal AI integrates text, vision, and speech within unified reasoning frameworks, yet Arabic remains significantly underrepresented due to diglossia, morphological complexity, and scarce multimodal resources. This survey delivers the first comprehensive technical roadmap for Arabic multimodal AI, covering the progression from unimodal Arabic NLP, OCR, and ASR to recent Arabic-capable Multimodal Large Language Models (MLLMs). We review available multimodal datasets, modality encoders, tokenization approaches, connector designs, and fusion strategies used in state-of-the-art systems. We also provide the first consolidated evaluation of Arabic-capable MLLMs on multimodal benchmarks ARB and PEARL analyzing performance, robustness, and domain generalization across OCR-grounded and open-domain VQA settings. Despite recent progress, challenges persist in cultural grounding, dialect inclusivity, dataset scale, and open-access ecosystem maturity. We outline actionable directions for scalable and culturally aligned Arabic multimodal intelligence, including parameter-efficient adaptation, broader corpus development, and unified evaluation protocols. By consolidating technical advances and empirical insights, this survey establishes a foundation to guide the next generation of Arabic-centric multimodal research.