Bridging Perception, Language, and Action: A Survey and Bibliometric Analysis of VLM & VLA Systems

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Vision-Language Models (VLMs) and Vision-Language-Action (VLA) models mark a pivotal advancement in multimodal AI, integrating perception, language understanding, and physical action for embodied intelligence in real-world applications. This survey delivers an in-depth bibliometric analysis of 1,798 publications sourced from Scopus and OpenAlex, merging quanti- tative computational techniques with qualitative examination to examine the field’s evolution, key themes, and persistent challenges. Through keyword frequency assessment, co-occurrence networks, temporal trend mapping, author collaboration visualization, and similarity-based clus- tering of titles and keywords, we uncover exponential publication growth since 2022—spanning a 50-fold increase—with core themes in language models, visual languages, and action recognition driving unified multimodal architectures. The analysis highlights ten keyword clusters centered on multimodal integration, robotic learning, and foundation models, alongside ten title clusters emphasizing applications from robotic navigation and video understanding to generalist agents and web navigation. Drawn author collaboration maps highlight geographic dominance by U.S. and Chinese institutions, which implies risks in technology governance and safety oversight. The qualitative review of impact papers traces VLA progression from closed-source fine-tuning to open-source transfer learning, rigorous grounding evaluations, cultural bias as- sessments, and deployments in virtual agents, autonomous driving, and robotic manipulation. Despite technical maturity, critical gaps persist in safety mechanisms, adversarial robustness, and standardized evaluation, urging prioritized research for responsible deployment.

Article activity feed