AEGIS-RL: Abstract, Explainable Graphs for Integrated Safety in RL

Tuan Le
Risal Shefin
Debashis Gupta
Thai Le
Sarra Alqahtani

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Ensuring the safety of reinforcement learning (RL) policies in high-stakes environments requires more than formal verification: it needs interpretability and targeted falsification—the deliberate search for counter-examples that expose potential failures before deployment. We present AEGIS-RL (Abstract, Explainable Graphs for Integrated Safety in RL), a hybrid framework that unifies (1) explainable RL, (2) probabilistic model checking, and (3) risk-guided falsification, and augments them with (4) a lightweight runtime safety shield that switches to a fallback policy when estimated risk exceeds a threshold. AEGIS-RL first builds a directed, semantically meaningful graph from offline trajectories that blends local and global explanations to make policy behavior transparent and verifier-friendly. This abstract graph is fed to a probabilistic model checker (e.g., Storm) to verify temporal safety specifications; when violations exist, the checker returns interpretable counterexample traces that pinpoint how the policy fails. When specifications appear satisfied, AEGIS-RL estimates residual risk during checking to steer falsification toward high-risk, under-explored states, broadening coverage beyond the offline data. Across safety-critical benchmarks including two MuJoCo tasks and a medical insulin-dosing scenario; AEGIS-RL uncovers significantly more violations than uncertainty- and fuzzing-based baselines and yields a broader, more novel set of failure trajectories. The resulting explanations and counterexamples provide actionable guidance to understand, debug, and repair unsafe policies while enabling runtime mitigation without retraining.

Version published to 10.21203/rs.3.rs-7512385/v1 on Research Square
Oct 8, 2025

Human-in-the-Loop Explainable AI for Reliable Autonomous Cybersecurity Infrastructure

This article has 1 author:
1. Hassan Adebayo
This article has no evaluationsLatest version Jan 27, 2026
Epistemic Bridge Protocol: Regulating Epistemic Commitment in Probabilistic Generative Agents

This article has 2 authors:
1. AZRIL BIN HAMZAH
2. TENG SHASHA
This article has no evaluationsLatest version Jan 16, 2026
Interpretable Deep Learning Architectures for Decision-Critical Cyber-Physical Systems

This article has 2 authors:
1. Prof. Daniel K. Whitmore
2. Ayesha Rahman
This article has no evaluationsLatest version Dec 10, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Human-in-the-Loop Explainable AI for Reliable Autonomous Cybersecurity Infrastructure

Epistemic Bridge Protocol: Regulating Epistemic Commitment in Probabilistic Generative Agents

Interpretable Deep Learning Architectures for Decision-Critical Cyber-Physical Systems