Quantifying the relative impact of embedding models and system architecture on first stage dense retrieval performance

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Retrieval-augmented generation (RAG) systems increasingly depend on dense retrieval as a critical subsystem that constrains end-to-end performance, reliability, and cost. In practice, retrieval quality is often attributed primarily to embedding model choice, while the impact of system-level design decisions remains less well quantified under operational constraints. This paper presents a controlled empirical study of first-stage dense retrieval, comparing the effect of embedding model selection with architectural choices, specifically index structure and retrieval-unit granularity. Holding the corpus, evaluation protocol, and similarity function constant, we evaluate a lightweight baseline encoder and a modern retrieval-optimized encoder across exact and approximate vector search configurations and multiple chunk sizes. Performance is measured using standard retrieval metrics alongside tail-latency statistics representative of production workloads. The results show that embedding model upgrades produce limited and inconsistent improvements in first-stage retrieval quality while substantially increasing computational cost, whereas system-level design choices induce large and predictable shifts in the quality latency trade-off. These findings indicate that architectural decisions often dominate embedding choice in determining first-stage retrieval behavior and provide practical guidance for prioritizing engineering effort in retrieval-backed systems.

Article activity feed