Multilingual Rag Agents For Localized Knowledge: Adaptive Indexing For Under-Represented Languages

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The democratization of information through Retrieval-Augmented Generation (RAG) is hindered by its inherent bias towards dominant languages, primarily English. This work introduces Multilingual Adaptive RAG (MARAG), a novel framework for building multilingual RAG agents capable of accurate knowledge localization in under-represented languages. MARAG addresses the dual challenge of sparse text embedding spaces and cross-lingual knowledge transfer by implementing an adaptive, multi-stage indexing pipeline. This pipeline intelligently selects between dense vector retrieval, hybrid sparse-dense methods, and a novel cross-lingual knowledge graph alignment technique based on the linguistic properties and resource availability of the target language. We empirically evaluate MARAG on a curated benchmark covering three typologically diverse under-represented languages (Swahili, Bengali, and Amharic) and one high-resource language (Spanish) for contrast. Our framework demonstrates a consistent and significant improvement over static, monolingual RAG baselines and direct machine translation approaches, with an average increase of 22.7% in Answer Relevance and 18.3% in Factual Accuracy for the under-represented languages. Furthermore, we show that MARAG’s adaptive indexing reduces latency by up to 40% for languages where hybrid methods are optimal. This research establishes a scalable, resource-aware paradigm for equitable information access, providing a concrete pathway to bridge the digital knowledge divide.

Article activity feed