Construction and Practical Validation of an Evaluation Framework for General-Purpose Agents

Weidong Liu
Xiaofei Ma
Ling Jin
Shuo Liu
Rui Wang
Yanyang Liu
Xinru Fan

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Intelligent agent technology represents a pivotal breakthrough in the evolution of artificial intelligence, marking the shift from systems that merely "understand" to those capable of autonomous action. As this technology becomes increasingly central to AI deployment, the establishment of a scientifically rigorous and standardized evaluation framework has become essential for supporting and accelerating the industrialization of AI applications. However, the development of such a framework presents numerous challenges due to the complexity and diversity of intelligent agent tasks. To address these challenges, this study introduces the YiHeng Agent Evaluation System—a comprehensive, objective, and user-centered framework. It employs a "2-4-2" hierarchical structure that includes two types of evaluation scenarios, four key evaluation elements, and two overarching evaluation dimensions. The system evaluates not only functional capabilities, such as usability and effectiveness, but also user experience factors, including ease of use and satisfaction. To validate the proposed framework, empirical evaluations were conducted on eight leading general-purpose intelligent agents from across the globe. In parallel, supporting evaluation tools were developed through targeted engineering implementations to facilitate systematic assessment. The results confirm the framework's effectiveness and practical relevance, providing actionable insights for enhancing agent performance and promoting the sustainable advancement of the AI industry.

Version published to 10.21203/rs.3.rs-7936146/v1 on Research Square
Nov 18, 2025

Towards a Science of Scaling Agent Systems

This article has 20 authors:
1. Yubin Kim
2. Ken Gu
3. Chanwoo Park
4. Chunjong Park
5. Samuel Schmidgall
6. A. Ali Heydari
7. Yao Yan
8. Zhihan Zhang
9. Yuchen Zhuang
10. Yun Liu
11. Mark Malhotra
12. Paul Liang
13. Hae Won Park
14. Yuzhe Yang
15. Xuhai Xu
16. Yilun Du
17. Shwetak Patel
18. Tim Althoff
19. Daniel McDuff
20. Xin Liu
This article has no evaluationsLatest version Jan 23, 2026
Evaluation and Benchmarking of Generative and Agentic AI Systems: A Comprehensive Survey

This article has 1 author:
1. Manish Shukla
This article has no evaluationsLatest version Dec 16, 2025
Large Language Model Agents: A Comprehensive Survey on Architectures, Capabilities, and Applications

This article has 8 authors:
1. Yiming Lei
2. Jiawei Xu
3. Chia Xin Liang
4. Ziqian Bi
5. Xiaoming Li
6. Danyang Zhang
7. Junhao Song
8. Zhenyu Yu
This article has no evaluationsLatest version Dec 24, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Towards a Science of Scaling Agent Systems

Evaluation and Benchmarking of Generative and Agentic AI Systems: A Comprehensive Survey

Large Language Model Agents: A Comprehensive Survey on Architectures, Capabilities, and Applications