Agent Harness for Large Language Model Agents: A Survey
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Nowadays, the reliability of large language model (LLM) agents in production environments is increasingly determined not by the underlying model but by the agent harness that encapsulates it. As tasks grow longer and more complex, recent studies demonstrate order-of-magnitude reliability gains achieved through harness redesign alone, with the underlying model held fixed. However, existing literature typically focuses on individual components of harness such as memory, planning, and tool use in isolation. In this paper, we are the first to conduct a systematic survey on the LLM agent harness, based on a comprehensive review of over hundred papers, technical reports, and industry blogs. Specifically, the core works of this paper is presented below: (1) A formal definition of the agent harness as a six-component tuple H = (E,T,C,S,L,V), i.e., execution loop, tool registry, context manager, state store, lifecycle hooks, and evaluation interface. (2) A historical tracing the evolution of the harness concept from software testing and reinforcement learning environments to modern LLM agent systems, identifying a unifying architectural pattern: enabling a controllable, observable, verifiable runtime environment for unpredictable execution agents. (3) An empirically grounded taxonomy of 23 representative systems, classified via a six-component completeness matrix that enables direct cross-framework comparison, and reveal systems mature enough for real-world production deployment consistently exhibit a complete implementation of all six architectural components. (4) A systematic analysis of nine cross-cutting technical challenges spanning sandboxing, evaluation, protocol standardization, and compute economics, including an empirical comparison of tool-level and agent-level interoperability protocols and an assessment of ultra-long-context model implications. (5) A proposal of key future research directions where harness-layer infrastructure remains significantly underdeveloped relative to advances in component capabilities. The latest version of this paper is available at: https://github.com/Gloriaameng/Awesome-Agent-Harness