Learning to Retrieve, Generate, and Compress: A Unified View of Efficient RAG
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Retrieval-Augmented Generation (RAG) has become a foundational technique in natural language processing and AI systems, enabling large language models (LLMs) to dynamically condition on external knowledge during inference by retrieving relevant documents from large corpora. This hybrid approach enhances the factuality, transparency, and adaptability of generation by combining the parametric knowledge of pre-trained transformers with non-parametric retrieval mechanisms. Despite its growing adoption across a range of domains—from open-domain question answering and knowledge-intensive tasks to scientific research, legal reasoning, and enterprise applications—RAG presents unique computational and modeling challenges. These include latency from multi-stage pipelines, retrieval-generation misalignment, context redundancy, faithfulness issues, and scalability constraints in handling billion-scale document indexes.This survey provides a comprehensive and deeply technical overview of the emerging landscape of efficient Retrieval-Augmented Generation for foundation models. We begin with a formal characterization of the RAG problem space, presenting unified mathematical formulations of its retrieval and generation components, including probabilistic models and variational interpretations. We categorize the primary RAG paradigms, retrieval-then-generation, retrieval-as-context, retrieval-as-planning, and iterative RAG—and analyze their respective strengths and computational bottlenecks. We then delve into a detailed taxonomy of efficiency-oriented methods that improve retrieval quality, reduce inference latency, minimize memory consumption, and enable end-to-end trainability. Techniques surveyed include dense and sparse vector indexing, multi-vector compression, approximate nearest neighbor search, memory pruning, retrieval reranking, late interaction models, early exit decoding, passage filtering, and hybrid sparse-dense fusion mechanisms.To provide empirical grounding, we compile and analyze benchmark results across standard datasets (e.g., Natural Questions, TriviaQA, ELI5, FEVER, HotpotQA), architectures (e.g., DPR, FiD, REALM, RAG-Sequence, Atlas, GTR), and hardware setups. A comparative evaluation highlights the trade-offs between retrieval cost, generation accuracy, document redundancy, and interpretability across open-domain, multi-hop, and domain-adaptive settings. We also discuss auxiliary tools such as rerankers, memory controllers, and faithful decoding strategies that significantly impact performance.Beyond the current state-of-the-art, we identify key open challenges and propose a set of future research directions, including unified end-to-end optimization of retriever and generator, retrieval-aware generation objectives, dynamic and query-adaptive context selection, multimodal and multilingual retrieval integration, scalable lifelong learning architectures, and robust hallucination mitigation strategies. We emphasize the importance of developing more transparent, personalized, and controllable RAG systems that align with human expectations and safety norms.Ultimately, this survey aims to serve as both a foundational resource and a strategic roadmap for researchers and practitioners working on efficient and grounded language generation. As RAG continues to mature, it holds the promise of unlocking new capabilities for language models by enabling them to reason over explicit knowledge at scale—bridging the gap between memorization and inference, and laying the groundwork for next-generation interactive, agentic, and trustworthy AI systems.