Graph-Fused Vision-Language-Action Models for Semantically Safe Dual-Robot Control via Control Barrier Functions

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Deploying dual-arm robots in human-centric environments demands not only dexterous task execution but also strict adherence to common sense safety constraints. While recent advancements in Vision-Language-Action (VLA) models enable complex policy reasoning from human demonstrations, they typically lack the formal motion safeguards required to prevent semantically unsafe behaviors—such as manipulating liquids above electronics. In this work, we propose a unified framework that integrates a Graph-Fused VLA (GF-VLA) model with a semantic safety filter, enabling task-level reasoning and certified safe execution for dual-robot systems. To generate manipulation strategies, our approach extracts information-theoretic cues from visual inputs to construct temporal scene graphs that capture intricate hand-object interactions. A language-conditioned transformer leverages these graphs to output hierarchical behavior trees, interpretable Cartesian commands, and optimal cross-hand assignments. Concurrently, to ensure execution safety, the system builds a 3D semantic map and utilizes the contextual reasoning capabilities of large language models to identify semantically unsafe spatial relationships and poses. These semantic rules, alongside traditional geometric collision bounds, are rigorously enforced at the continuous control level via a Control Barrier Function (CBF) certification formulation. We evaluate the proposed framework across diverse dual-arm manipulation scenarios, encompassing complex spatial generalizations and practical real-world semantic constraints. Our results demonstrate that fusing information-theoretic scene representations with CBF-based motion safeguards yields highly reliable, human-readable task policies. Ultimately, this approach achieves high execution success rates while guaranteeing safe robot operation well beyond traditional collision avoidance.

Article activity feed