Vision-Language-Action (VLA) Models for Unmanned Aerial Robotics and Bimanual Manipulation: A Review

Inkyu Sa
Chanoh Park
Ho Seok Ahn

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Vision-Language-Action (VLA) models unify visual perception, natural-language understanding, and action generation within a single foundation model, allowing a robot to follow instructions such as “fold the towel” or “fly to the red building” directly from camera images. Because VLAs inherit world knowledge from internet-scale pre-training, they have become the dominant framework for learning-based manipulation, with bimanual coordination serving as the most demanding testbed: two arms with 7+ degrees of freedom each must move in concert to fold, assemble, and reorient objects. Unmanned aerial robotics faces a structurally similar challenge: a drone must coordinate thrust, attitude, and increasingly gripper commands from visual observations under strict latency and payload constraints. This review covers 186 contributions spanning 2017–2026 and organized along seven dimensions: VLA architectures, training recipes, action representations, bimanual coordination (2022–2026), unmanned aerial vehicle (UAV) navigation and control (2017–2026), language grounding, and cross-cutting concerns including memory and world models. We show that the coordination strategies, training recipes, and action representations developed for bimanual VLAs transfer to unmanned aerial systems, and identify fourteen research directions across both domains.

Version published to 10.20944/preprints202604.0664.v1
Apr 9, 2026

Graph-Fused Vision-Language-Action Models for Semantically Safe Dual-Robot Control via Control Barrier Functions

This article has 3 authors:
1. Jiajun Gu
2. Weihao Cheng
3. Longsen Gao
This article has no evaluationsLatest version Apr 1, 2026
Build on Priors: Vision-Language-Guided Neuro-Symbolic Imitation Learning for Data-Efficient Real-World Robot Manipulation

This article has 6 authors:
1. Pierrick Lorang
2. Johannes Huemer
3. Timothy Duggan
4. Kai Goebel
5. Patrik Zips
6. Matthias Scheutz
This article has no evaluationsLatest version Apr 7, 2026
Implicit Semantic Control Manifolds for Learning-Enabled Multi-UAV Coordination

This article has 4 authors:
1. Bryan Starbuck
2. Won Jang
3. Saee Sholapurkar
4. Bert Bras
This article has no evaluationsLatest version Mar 24, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Graph-Fused Vision-Language-Action Models for Semantically Safe Dual-Robot Control via Control Barrier Functions

Build on Priors: Vision-Language-Guided Neuro-Symbolic Imitation Learning for Data-Efficient Real-World Robot Manipulation

Implicit Semantic Control Manifolds for Learning-Enabled Multi-UAV Coordination