Graph-Fused Vision-Language-Action Models for Semantically Safe Dual-Robot Control via Control Barrier Functions

Jiajun Gu
Weihao Cheng
Longsen Gao

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Deploying dual-arm robots in human-centric environments demands not only dexterous task execution but also strict adherence to common sense safety constraints. While recent advancements in Vision-Language-Action (VLA) models enable complex policy reasoning from human demonstrations, they typically lack the formal motion safeguards required to prevent semantically unsafe behaviors—such as manipulating liquids above electronics. In this work, we propose a unified framework that integrates a Graph-Fused VLA (GF-VLA) model with a semantic safety filter, enabling task-level reasoning and certified safe execution for dual-robot systems. To generate manipulation strategies, our approach extracts information-theoretic cues from visual inputs to construct temporal scene graphs that capture intricate hand-object interactions. A language-conditioned transformer leverages these graphs to output hierarchical behavior trees, interpretable Cartesian commands, and optimal cross-hand assignments. Concurrently, to ensure execution safety, the system builds a 3D semantic map and utilizes the contextual reasoning capabilities of large language models to identify semantically unsafe spatial relationships and poses. These semantic rules, alongside traditional geometric collision bounds, are rigorously enforced at the continuous control level via a Control Barrier Function (CBF) certification formulation. We evaluate the proposed framework across diverse dual-arm manipulation scenarios, encompassing complex spatial generalizations and practical real-world semantic constraints. Our results demonstrate that fusing information-theoretic scene representations with CBF-based motion safeguards yields highly reliable, human-readable task policies. Ultimately, this approach achieves high execution success rates while guaranteeing safe robot operation well beyond traditional collision avoidance.

Version published to 10.31224/6733
Apr 1, 2026

Build on Priors: Vision-Language-Guided Neuro-Symbolic Imitation Learning for Data-Efficient Real-World Robot Manipulation

This article has 6 authors:
1. Pierrick Lorang
2. Johannes Huemer
3. Timothy Duggan
4. Kai Goebel
5. Patrik Zips
6. Matthias Scheutz
This article has no evaluationsLatest version Apr 7, 2026
Vision-Language-Action (VLA) Models for Unmanned Aerial Robotics and Bimanual Manipulation: A Review

This article has 3 authors:
1. Inkyu Sa
2. Chanoh Park
3. Ho Seok Ahn
This article has no evaluationsLatest version Apr 9, 2026
A Biomimetic Dual-Brain Architecture for Robotics: Bridging Large Language Models and Reactive Control through Control Barrier Functions, Experience Memory, and Entropy-Guided Fine-Tuning

This article has 1 author:
1. Qi Liang
This article has no evaluationsLatest version Apr 12, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Build on Priors: Vision-Language-Guided Neuro-Symbolic Imitation Learning for Data-Efficient Real-World Robot Manipulation

Vision-Language-Action (VLA) Models for Unmanned Aerial Robotics and Bimanual Manipulation: A Review

A Biomimetic Dual-Brain Architecture for Robotics: Bridging Large Language Models and Reactive Control through Control Barrier Functions, Experience Memory, and Entropy-Guided Fine-Tuning