Multilingual Rag Agents For Localized Knowledge: Adaptive Indexing For Under-Represented Languages
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The democratization of information through Retrieval-Augmented Generation (RAG) is hindered by its inherent bias towards dominant languages, primarily English. This work introduces Multilingual Adaptive RAG (MARAG), a novel framework for building multilingual RAG agents capable of accurate knowledge localization in under-represented languages. MARAG addresses the dual challenge of sparse text embedding spaces and cross-lingual knowledge transfer by implementing an adaptive, multi-stage indexing pipeline. This pipeline intelligently selects between dense vector retrieval, hybrid sparse-dense methods, and a novel cross-lingual knowledge graph alignment technique based on the linguistic properties and resource availability of the target language. We empirically evaluate MARAG on a curated benchmark covering three typologically diverse under-represented languages (Swahili, Bengali, and Amharic) and one high-resource language (Spanish) for contrast. Our framework demonstrates a consistent and significant improvement over static, monolingual RAG baselines and direct machine translation approaches, with an average increase of 22.7% in Answer Relevance and 18.3% in Factual Accuracy for the under-represented languages. Furthermore, we show that MARAG’s adaptive indexing reduces latency by up to 40% for languages where hybrid methods are optimal. This research establishes a scalable, resource-aware paradigm for equitable information access, providing a concrete pathway to bridge the digital knowledge divide.