HB-Eval: A System-Level Reliability Evaluation and Certification Framework for Agentic AI

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Current evaluation paradigms for agentic AI focus predominantly on task success rates under nominal conditions, creating a critical blind spot: agents may succeed under ideal circumstances while exhibiting catastrophic failure modes under stress. We propose HB-Eval, a rigorous methodology for measuring behavioral reliability through three complementary metrics: Failure Resilience Rate (FRR) quantifying recovery from systematic fault injection, Planning Efficiency Index (PEI) measuring trajectory optimality against oracle-verified paths, and Traceability Index (TI) evaluating reasoning transparency via calibrated LLM-as-a-Judge (κ = 0.82 with human consensus). Through systematic evaluation across 500 episodes spanning three strategically selected domains (logistics, healthcare, coding), we demonstrate a 42.9 percentage point reliability gap between nominal success rates and stressed performance for baseline architectures. We introduce an integrated resilience architecture combining Eval-Driven Memory (EDM) for selective experience consolidation, Adaptive Planning for PEI-guided recovery, and Human-Centered Explainability (HCI-EDM) for trust calibration. This closed-loop system achieves 94.2% ±2.1% FRR with statistically significant improvements over base lines (Cohen’s d = 3.28, p < 0.001), establishing a rigorous methodology for transitioning agentic AI from capability demonstrations to reliability-certified deployment. We conclude by proposing a three-tier certification framework and identifying critical research directions for community validation.

Article activity feed