AEGIS-RL: Abstract, Explainable Graphs for Integrated Safety in RL

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Ensuring the safety of reinforcement learning (RL) policies in high-stakes environments requires more than formal verification: it needs interpretability and targeted falsification—the deliberate search for counter-examples that expose potential failures before deployment. We present AEGIS-RL (Abstract, Explainable Graphs for Integrated Safety in RL), a hybrid framework that unifies (1) explainable RL, (2) probabilistic model checking, and (3) risk-guided falsification, and augments them with (4) a lightweight runtime safety shield that switches to a fallback policy when estimated risk exceeds a threshold. AEGIS-RL first builds a directed, semantically meaningful graph from offline trajectories that blends local and global explanations to make policy behavior transparent and verifier-friendly. This abstract graph is fed to a probabilistic model checker (e.g., Storm) to verify temporal safety specifications; when violations exist, the checker returns interpretable counterexample traces that pinpoint how the policy fails. When specifications appear satisfied, AEGIS-RL estimates residual risk during checking to steer falsification toward high-risk, under-explored states, broadening coverage beyond the offline data. Across safety-critical benchmarks including two MuJoCo tasks and a medical insulin-dosing scenario; AEGIS-RL uncovers significantly more violations than uncertainty- and fuzzing-based baselines and yields a broader, more novel set of failure trajectories. The resulting explanations and counterexamples provide actionable guidance to understand, debug, and repair unsafe policies while enabling runtime mitigation without retraining.

Article activity feed