Build on Priors: Vision-Language-Guided Neuro-Symbolic Imitation Learning for Data-Efficient Real-World Robot Manipulation

Pierrick Lorang
Johannes Huemer
Timothy Duggan
Kai Goebel
Patrik Zips
Matthias Scheutz

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Enabling robots to learn long-horizon manipulation tasks from a handful of demonstrations remains a central challenge in robotics. Existing neuro-symbolic approaches often rely on hand-crafted symbolic abstractions, semantically labeled trajectories or large demonstration datasets, limiting their scalability and real-world applicability. We present a scalable neuro-symbolic framework that autonomously constructs symbolic planning domains and data-efficient control policies from as few as one to thirty unannotated skill demonstrations, without requiring manual domain engineering. Our method segments demonstrations into skills and employs a Vision-Language Model (VLM) to classify skills and identify equivalent high-level states, enabling automatic construction of a state-transition graph. This graph is processed by an Answer Set Programming solver to synthesize a PDDL planning domain, which an oracle function exploits to isolate the minimal, task-relevant and target relative observation and action spaces for each skill policy. Policies are learned at the control reference level rather than at the raw actuator signal level, yielding a smoother and less noisy learning target. Known controllers can be leveraged for real-world data augmentation by projecting a single demonstration onto other objects in the scene, simultaneously enriching the graph construction process and the dataset for imitation learning. We validate our framework primarily on a real industrial forklift across statistically rigorous manipulation trials, and demonstrate cross-platform generality on a Kinova Gen3 robotic arm across two standard benchmarks. Our results show that grounding control learning, VLM-driven abstraction, and automated planning synthesis into a unified pipeline constitutes a practical path toward scalable, data-efficient, expert-free and interpretable neuro-symbolic robotics.

Version published to 10.21203/rs.3.rs-9254688/v1 on Research Square
Apr 7, 2026

Graph-Fused Vision-Language-Action Models for Semantically Safe Dual-Robot Control via Control Barrier Functions

This article has 3 authors:
1. Jiajun Gu
2. Weihao Cheng
3. Longsen Gao
This article has no evaluationsLatest version Apr 1, 2026
A Biomimetic Dual-Brain Architecture for Robotics: Bridging Large Language Models and Reactive Control through Control Barrier Functions, Experience Memory, and Entropy-Guided Fine-Tuning

This article has 1 author:
1. Qi Liang
This article has no evaluationsLatest version Apr 12, 2026
Toward Zero Fixed Code: Complete-Generation-Information-Driven Self-Learning for Autonomous Robots via Multi-Layer Code Replaceability and Safe Rollback

This article has 1 author:
1. Hong Su
This article has no evaluationsLatest version Apr 9, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Graph-Fused Vision-Language-Action Models for Semantically Safe Dual-Robot Control via Control Barrier Functions

A Biomimetic Dual-Brain Architecture for Robotics: Bridging Large Language Models and Reactive Control through Control Barrier Functions, Experience Memory, and Entropy-Guided Fine-Tuning

Toward Zero Fixed Code: Complete-Generation-Information-Driven Self-Learning for Autonomous Robots via Multi-Layer Code Replaceability and Safe Rollback