Evaluating Agentic AI Systems: A Balanced Framework for Performance, Robustness, Safety and Beyond
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Agentic artificial intelligence (AI)—multi-agent systems that combine large language models with external tools and autonomous planning—are rapidly transitioning from research labs into high-stakes domains. Existing evaluations emphasise narrow technical metrics such as task success or latency, leaving important sociotechnical dimensions like human trust, ethical compliance and economic sustainability under-measured. We propose a balanced evaluation framework spanning five axes (capability&efficiency, robustness& adaptability, safetyðics, human-centred interaction and economic&sustainability) and introduce novel indicators including goal-drift scores and harm-reduction indices. Beyond synthesising prior work, we identify gaps in current benchmarks, develop a conceptual diagram to visualise interdependencies and outline experimental protocols for empirically validating the framework. Case studies from recent industry deployments illustrate that agentic AI can yield 20–60 % productivity gains yet often omit assessments of fairness, trust and long-term sustainability. We argue that multidimensional evaluation—combining automated metrics with human-in-the-loop scoring and economic analysis—is essential for responsible adoption of agentic AI.