Agent Harness for Large Language Model Agents: A Survey

Qianyu Meng
Yanan Wang
Liyi Chen
Yihang Li
Wei Wu
Wenyuan Jiang
Qimeng Wang
Chengqiang Lu
Yan Gao
Yi Wu
Yao Hu

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Nowadays, the reliability of large language model (LLM) agents in production environments is increasingly determined not by the underlying model but by the agent harness that encapsulates it. As tasks grow longer and more complex, recent studies demonstrate order-of-magnitude reliability gains achieved through harness redesign alone, with the underlying model held fixed. However, existing literature typically focuses on individual components of harness such as memory, planning, and tool use in isolation. In this paper, we are the first to conduct a systematic survey on the LLM agent harness, based on a comprehensive review of over hundred papers, technical reports, and industry blogs. Specifically, the core works of this paper is presented below: (1) A formal definition of the agent harness as a six-component tuple H = (E,T,C,S,L,V), i.e., execution loop, tool registry, context manager, state store, lifecycle hooks, and evaluation interface. (2) A historical tracing the evolution of the harness concept from software testing and reinforcement learning environments to modern LLM agent systems, identifying a unifying architectural pattern: enabling a controllable, observable, verifiable runtime environment for unpredictable execution agents. (3) An empirically grounded taxonomy of 23 representative systems, classified via a six-component completeness matrix that enables direct cross-framework comparison, and reveal systems mature enough for real-world production deployment consistently exhibit a complete implementation of all six architectural components. (4) A systematic analysis of nine cross-cutting technical challenges spanning sandboxing, evaluation, protocol standardization, and compute economics, including an empirical comparison of tool-level and agent-level interoperability protocols and an assessment of ultra-long-context model implications. (5) A proposal of key future research directions where harness-layer infrastructure remains significantly underdeveloped relative to advances in component capabilities. The latest version of this paper is available at: https://github.com/Gloriaameng/Awesome-Agent-Harness

Version published to 10.20944/preprints202604.0428.v3
Apr 28, 2026
Version published to 10.20944/preprints202604.0428.v2
Apr 9, 2026
Version published to 10.20944/preprints202604.0428.v1
Apr 7, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed