Large-Scale Model-Enhanced Vision-Language Navigation: Recent Advances, Practical Applications, and Future Challenges

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The ability to autonomously navigate and explore complex 3D environments in a purposeful manner, while integrating visual perception with natural language interaction in a human-like way, represents a longstanding research objective in Artificial Intelligence (AI) and embodied cognition. Vision-Language Navigation (VLN) has evolved from geometry-driven to semantics-driven and, more recently, knowledge-driven approaches. With the introduction of Large Language Models (LLMs) and Vision-Language Models (VLMs), recent methods have achieved substantial improvements in instruction interpretation, cross-modal alignment, and reasoning-based planning. However, existing surveys primarily focus on traditional VLN settings and offer limited coverage of LLM-based VLN, particularly in relation to Sim2Real transfer and edge-oriented deployment. This paper presents a structured review of LLM-enabled VLN, covering four core components: instruction understanding, environment perception, high-level planning, and low-level control. Edge deployment and implementation requirements, datasets, and evaluation protocols are summarized, along with an analysis of task evolution from path-following to goal-oriented and demand-driven navigation. Key challenges, including reasoning complexity, spatial cognition, real-time efficiency, robustness, and Sim2Real adaptation, are examined. Future research directions, such as knowledge-enhanced navigation, multimodal integration, and world-model-based frameworks, are discussed. Overall, LLM-driven VLN is progressing toward deeper cognitive integration, supporting the development of more explainable, generalizable, and deployable embodied navigation systems.

Article activity feed