Quantifying the relative impact of embedding models and system architecture on first stage dense retrieval performance
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Retrieval-augmented generation (RAG) systems increasingly depend on dense retrieval as a critical subsystem that constrains end-to-end performance, reliability, and cost. In practice, retrieval quality is often attributed primarily to embedding model choice, while the impact of system-level design decisions remains less well quantified under operational constraints. This paper presents a controlled empirical study of first-stage dense retrieval, comparing the effect of embedding model selection with architectural choices, specifically index structure and retrieval-unit granularity. Holding the corpus, evaluation protocol, and similarity function constant, we evaluate a lightweight baseline encoder and a modern retrieval-optimized encoder across exact and approximate vector search configurations and multiple chunk sizes. Performance is measured using standard retrieval metrics alongside tail-latency statistics representative of production workloads. The results show that embedding model upgrades produce limited and inconsistent improvements in first-stage retrieval quality while substantially increasing computational cost, whereas system-level design choices induce large and predictable shifts in the quality latency trade-off. These findings indicate that architectural decisions often dominate embedding choice in determining first-stage retrieval behavior and provide practical guidance for prioritizing engineering effort in retrieval-backed systems.