An Exploratory Study of Code Retrieval Techniques in Coding Agents

This article has been Reviewed by the following groups

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Abstract

Code retrieval is central to coding agents. It is the process of sourcing relevant code snippets, documentation, or knowledge from repositories into the context for the agent to make informed actions. Thus, efficient code retrieval could have a major positive impact on the performance of coding agents and the quality of their output. This study delves into different code retrieval techniques, their integration in agentic workflows, and how they enhance coding agent output quality. We compare how human programmers and agents interact with tools, analyze lexical versus semantic search for code retrieval, evaluate retrieval's impact, and review benchmarks focusing on metrics such as latency, tokens, context utilization, and iteration loops. We report takeaways on the effectiveness of different retrieval tools, potential solutions, and opportunities for further research.

Article activity feed

  1. This Zenodo record is a permanently preserved version of a Structured PREreview. You can view the complete PREreview at https://prereview.org/reviews/17536688.

    Does the introduction explain the objective of the research presented in the preprint? Yes The introduction clearly sets out the article's objective: to analyze how different coding agents perform code retrieval, compare lexical and semantic approaches, examine their integration into agentic loops, and characterize trade-offs (token efficiency, context usage, latency/iterations). It announces an exploratory approach focused on "real" agents and sets the stage for a more rigorous future benchmark; further on, the three research questions (semantic vs. lexical, value of "human" tools such as LSP for agents, value of specialized retrieval sub-agents) formalize this objective without ambiguity. To strengthen the introduction, it would help to restate these three RQs early (not only in Methods), flag the "one task/one repository" exploratory scope up front, and add immediate sources for several general claims (e.g., the context bottleneck, the role of grep/ripgrep).
    Are the methods well-suited for this research? Somewhat appropriate For a qualitative exploratory study, the method is generally consistent and well executed: same task and repository state for all runs, repeated trials to capture LLM non-determinism, selection of agents covering diverse paradigms (lexical, hybrid, AST/graph, multi-agent), and structured collection of execution traces, qualitative observations, and quantitative measures (tokens, tool calls, costs, completion status), with appendices that support transparency and reproducibility. However, several gaps limit external validity: (1) single task/repository (low diversity); (2) confounding variables are not isolated (different models, context windows, prompts, tool inventories, and architectures), preventing causal attribution to retrieval alone; (3) no controlled ablations (e.g., same model with lexical vs. semantic), no standardized retrieval metrics (precision/recall/coverage of retrieved snippets), and no statistical analysis; (4) measurement artifacts (e.g., cached tokens obscuring actual context pressure) and no harmonized latency/wall-time reporting; (5) qualitative coding lacks inter-annotator agreement. Overall, the method is a strong basis for hypothesis generation and behavior comparison, but not for causal or general claims. To reach "highly appropriate," add a multi-task/multi-repo benchmark, controlled ablations fixing model/window/tools, standardized retrieval metrics, normalized cost/latency (including cache handling), inferential statistics, and ideally a pre-registered plan with double-coding for qualitative judgments.
    Are the conclusions supported by the data? Somewhat supported Mostly yes for an exploratory study. Conclusions are aligned with the evidence: (1) RQ1 — no clear win for semantic search on this task is consistent with universal task success and differences largely in token usage; (2) LSP — added overhead without measured gains matches the traces, while LSP-inspired structural approaches (e.g., Aider) fare better when adapted for agents; (3) multi-agents — Amp's low token use alongside coordination overhead, with no systematic edge over strong single-agent setups, fits the data; (4) the transparency-vs-efficiency trade-off appears across agents. That said, the suggestion that AST/graph methods "outperform" others remains tentative: results come from one task/one repo, with confounders and no standardized retrieval metrics. The manuscript appropriately frames these as hypotheses.
    Are the data presentations, including visualizations, well-suited to represent the data? Neither appropriate and clear nor inappropriate and unclear The tables convey main results and trends, but visuals could be stronger and more accessible. Key metrics aren't fully normalized (e.g., displayed vs. actual tokens with caching) and need clearer notes/a common scale. Simple conceptual diagrams would help (Lexical vs. RAG vs. CKG vs. LSP, agent loop, sub-agents), as would standard charts (precision/recall, cost/latency, stacked tool calls, timeline views). Clearer labels, legends, units, and scope would further improve interpretability.
    How clearly do the authors discuss, explain, and interpret their findings and potential next steps for the research? Somewhat clearly Generally clear and tied to the core RQs, with sensible takeaways (transparency-vs-efficiency trade-offs, limits of direct LSP use for agents, and why AST/graph can be frugal). Limitations are candid, and "Future Work" outlines an actionable benchmark agenda. The narrative might leans a bit too much on token consumption as a proxy; several claims would benefit from tighter sourcing and operational definitions (e.g., criteria for "sufficient retrieval," normalized metrics beyond tokens), and a corrected MarsCode note: 88.3% file-localization on SWE-bench Lite's 12 Python repositories; tooling supports ~12 languages—capability, not evaluated accuracy. Next steps could be more concrete: specify the benchmark task matrix (refactor/bugfix/feature/test), the ablation grid (same model/window/tools; lexical vs. semantic vs. AST vs. LSP), standardized retrieval metrics (precision/recall/coverage), latency and cost reporting, plus a reproducibility plan (released traces, seeds, configs). A short "RQs -> Findings -> Implications" table and a practitioner decision guide (an opinion about when to use lexical vs. semantic vs. AST vs. multi-agent) would also help. Overall: clear and insightful for an exploratory study, but the roadmap to quantitative validation could be more explicit.
    Is the preprint likely to advance academic knowledge? Somewhat likely It offers a useful comparative snapshot across seven agents and retrieval paradigms, surfaces cross-cutting patterns (transparency-vs-efficiency trade-offs, limits of direct LSP use, economy of AST/graph approaches), contributes curated traces, a clear taxonomy, and a concrete roadmap toward a benchmark. Impact is tempered by the exploratory single task/repo design, confounders, and reliance on tokens as a proxy without standardized retrieval metrics or controlled ablations. Minor accuracy issues (e.g., MarsCode conflation) and the need for tighter sourcing/definitions also limit generalizability. Well positioned to stimulate follow-ups and benchmark development; contributions are primarily hypothesis-generating.
    Would it benefit from language editing? No Language is generally clear and does not impede understanding. A light polish would still improve consistency: trim a few long sentences in Introduction/Background, keep terminology/hyphenation consistent ("multi-agent," "agentic," "code retrieval"), define acronym on first use (e.g. SIMD), resolve any placeholder citations, and ensure stable links in references (e.g., arXiv) with fixed GitHub URLs where possible. These are minor edits for polish, not substantive language fixes.
    Would you recommend this preprint to others? Yes, but it needs to be improved It provides a practical comparison across seven agents, highlights transparency-vs-efficiency patterns, and includes useful execution traces and a clear taxonomy. To strengthen it for broad recommendation: (1) correct and clarify the MarsCode statement (88.3% on SWE-bench Lite's 12 Python repos; language support, not: evaluated accuracy); (2) standardize metrics (separate input/output/thoughts tokens, account for cached tokens) and add retrieval precision/recall/coverage, latency, and cost; (3) include small ablations to reduce confounds (same model/context/tools; toggles for LSP/AST/semantic); (4) improve presentations and accessibility (clean tables; a few conceptual diagrams for Lexical/RAG/CKG/LSP/agentic/multi-agent); (5) tighten wording (define all acronyms on first use; resolve placeholder citations; add stable links; calibrate broad claims with sources or hedging); (6) release reproducibility artifacts (prompts, configs, seeds, logs, scripts to regenerate tables/figures). With these revisions, the paper would be a strong reference point for benchmarked work.
    Is it ready for attention from an editor, publisher or broader audience? Yes, after minor changes The preprint is a solid exploratory contribution with clear taxonomy, useful execution traces, and practical comparisons that merit editorial attention. Before wider circulation, please: (1) correct the MarsCode sentence (88.3% file-localization on SWE-bench Lite's 12 Python repos; language support ≠ evaluated accuracy). (2) standardize token accounting in tables/notes—separate input/output/thoughts and clearly indicate cached vs. non-cached tokens. (3) add 1–2 concise conceptual figures (e.g., Lexical vs. RAG vs. CKG vs. LSP; agent/sub-agent loop) to aid readers (4) fix references and links (stable arXiv URLs; working GitHub links), and define remaining acronyms on first use. (5) calibrate broad claims with clear scope ("one task/one repo; hypothesis-generating") and tighten a few phrasings for precision. and (6) add a brief reproducibility note (pointers to prompts/configs/logs, even if partial). These are polish-level edits; deeper ablations/benchmarks can follow in a later version. Overall, this was an engaging and genuinely interesting study to read, timely, insightful, and helpful in mapping today's code-retrieval landscape for coding agents.

    Competing interests

    The author declares that they have no competing interests.

    Use of Artificial Intelligence (AI)

    The author declares that they did not use generative AI to come up with new ideas for their review.