Learning to Retrieve, Generate, and Compress: A Unified View of Efficient RAG

Faruq Brontes
Jeanie Genesis
Zachariah Noa
Sigiwardaz Nymphodoros

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Retrieval-Augmented Generation (RAG) has become a foundational technique in natural language processing and AI systems, enabling large language models (LLMs) to dynamically condition on external knowledge during inference by retrieving relevant documents from large corpora. This hybrid approach enhances the factuality, transparency, and adaptability of generation by combining the parametric knowledge of pre-trained transformers with non-parametric retrieval mechanisms. Despite its growing adoption across a range of domains—from open-domain question answering and knowledge-intensive tasks to scientific research, legal reasoning, and enterprise applications—RAG presents unique computational and modeling challenges. These include latency from multi-stage pipelines, retrieval-generation misalignment, context redundancy, faithfulness issues, and scalability constraints in handling billion-scale document indexes.This survey provides a comprehensive and deeply technical overview of the emerging landscape of efficient Retrieval-Augmented Generation for foundation models. We begin with a formal characterization of the RAG problem space, presenting unified mathematical formulations of its retrieval and generation components, including probabilistic models and variational interpretations. We categorize the primary RAG paradigms, retrieval-then-generation, retrieval-as-context, retrieval-as-planning, and iterative RAG—and analyze their respective strengths and computational bottlenecks. We then delve into a detailed taxonomy of efficiency-oriented methods that improve retrieval quality, reduce inference latency, minimize memory consumption, and enable end-to-end trainability. Techniques surveyed include dense and sparse vector indexing, multi-vector compression, approximate nearest neighbor search, memory pruning, retrieval reranking, late interaction models, early exit decoding, passage filtering, and hybrid sparse-dense fusion mechanisms.To provide empirical grounding, we compile and analyze benchmark results across standard datasets (e.g., Natural Questions, TriviaQA, ELI5, FEVER, HotpotQA), architectures (e.g., DPR, FiD, REALM, RAG-Sequence, Atlas, GTR), and hardware setups. A comparative evaluation highlights the trade-offs between retrieval cost, generation accuracy, document redundancy, and interpretability across open-domain, multi-hop, and domain-adaptive settings. We also discuss auxiliary tools such as rerankers, memory controllers, and faithful decoding strategies that significantly impact performance.Beyond the current state-of-the-art, we identify key open challenges and propose a set of future research directions, including unified end-to-end optimization of retriever and generator, retrieval-aware generation objectives, dynamic and query-adaptive context selection, multimodal and multilingual retrieval integration, scalable lifelong learning architectures, and robust hallucination mitigation strategies. We emphasize the importance of developing more transparent, personalized, and controllable RAG systems that align with human expectations and safety norms.Ultimately, this survey aims to serve as both a foundational resource and a strategic roadmap for researchers and practitioners working on efficient and grounded language generation. As RAG continues to mature, it holds the promise of unlocking new capabilities for language models by enabling them to reason over explicit knowledge at scale—bridging the gap between memorization and inference, and laying the groundwork for next-generation interactive, agentic, and trustworthy AI systems.

Version published to 10.20944/preprints202508.1211.v1
Aug 18, 2025

Retrieval-Augmented Generation for Natural Language Processing: A Survey

This article has 11 authors:
1. Shangyu Wu
2. Ying Xiong
3. Yufei Cui
4. Haolun Wu
5. Can Chen
6. Ye Yuan
7. Lianming Huang
8. Xue Liu
9. Tei-Wei Kuo
10. Nan Guan
11. Chun Xue
This article has no evaluationsLatest version Aug 22, 2025
Fusion-Based Retrieval-Augmented Generation for Complex Question Answering with LLMs

This article has 6 authors:
1. Yumeng Sun
2. Renhan Zhang
3. Renzi Meng
4. Lian Lian
5. Heyi Wang
6. Xuehui Quan
This article has no evaluationsLatest version Jul 9, 2025
Efficient Inference of Large Language Models through Model Compression

This article has 4 authors:
1. James Whitmore
2. Clara Hastings
3. Amir Patel
4. Stephany Brody
This article has no evaluationsLatest version Aug 5, 2025

Listed in

Abstract

Article activity feed

Related articles

Retrieval-Augmented Generation for Natural Language Processing: A Survey

Fusion-Based Retrieval-Augmented Generation for Complex Question Answering with LLMs

Efficient Inference of Large Language Models through Model Compression