Semantic-Augmented Reality: A Hybrid Robotic Framework Combining Edge AI and Vision Language Models for Dynamic Industrial Inspection
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
With the rise of Industry 4.0, Augmented Reality (AR) has become pivotal for human-robot collaboration. However, most industrial AR systems still rely on pre-defined tracked images or markers, limiting adaptability in unmodeled or dynamic environments. This paper proposes a novel Interactive Semantic-Augmented Reality (ISAR) framework that synergizes Edge AI and Cloud Vision-Language Models (VLMs). To ensure real-time performance, we implement a Dual-Thread Asynchronous Architecture on the robotic edge, decoupling video streaming from AI inference. We introduce a Confidence-Based Triggering Mechanism, where a cloud-based VLM is invoked only when edge detection confidence falls below a specific threshold. Instead of traditional image cropping, we employ a Visual Prompting strategy—overlaying bounding boxes on full-frame images—to preserve spatial context for accurate VLM semantic analysis. Finally, the generated insights are anchored to the physical world via Screen-to-World Raycasting without fiducial markers. This framework realizes a semantic-aware 'Intelligent Agent' that enhances Human-in-the-Loop (HITL) decision-making in complex industrial settings.