Evaluation and Benchmarking of Generative and Agentic AI Systems: A Comprehensive Survey

Manish Shukla

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The rapid emergence of generative and agentic artificial intelligence (AI) has outpaced traditional evaluation practices. While large language models excel on static language benchmarks, real-world deployment demands more than accuracy on curated tasks. Agentic systems use planning, tool invocation, memory and multi-agent collaboration to perform complex workflows. Enterprise adoption therefore hinges on holistic assessments that include cost, latency, reliability, safety and multi-agent coordination. This survey provides a comprehensive taxonomy of evaluation dimensions, reviews existing benchmarks for generative and agentic systems, identifies gaps between laboratory tests and production requirements, and proposes future directions for more realistic, multi-dimensional benchmarking.

Version published to 10.31224/6009
Dec 16, 2025

Evaluation and Benchmarking of Generative and Agentic AI Systems: A Comprehensive Survey

This article has 1 author:
1. Manish Shukla
This article has no evaluationsLatest version Dec 17, 2025
Towards a Science of Scaling Agent Systems

This article has 20 authors:
1. Yubin Kim
2. Ken Gu
3. Chanwoo Park
4. Chunjong Park
5. Samuel Schmidgall
6. A. Ali Heydari
7. Yao Yan
8. Zhihan Zhang
9. Yuchen Zhuang
10. Yun Liu
11. Mark Malhotra
12. Paul Liang
13. Hae Won Park
14. Yuzhe Yang
15. Xuhai Xu
16. Yilun Du
17. Shwetak Patel
18. Tim Althoff
19. Daniel McDuff
20. Xin Liu
This article has no evaluationsLatest version Jan 23, 2026
Tool and Agent Selection for Large Language Model Agents in Production: A Survey

This article has 9 authors:
1. Elias Lumer
2. Anmol Gulati
3. Faheem Nizar
4. Dzmitry Hedroits
5. Atharva Mehta
6. Henry Hwangbo
7. Vamse Kumar Subbiah
8. Pradeep Honaganahalli Basavaraju
9. James A. Burke
This article has no evaluationsLatest version Dec 12, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Evaluation and Benchmarking of Generative and Agentic AI Systems: A Comprehensive Survey

Towards a Science of Scaling Agent Systems

Tool and Agent Selection for Large Language Model Agents in Production: A Survey