Collaborative Assembly with Dynamic Environment for Human-Robot Interaction via Multi-Modal Large Language Model

Kentaro Yamada
Nicholas Campbell

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Human-Robot Collaboration (HRC) holds significant potential but is hindered by real-world complexity, dynamism, and ambiguous human instructions. This paper introduces CADE-HRI, a novel multi-modal HRC system enabling natural and flexible interaction for assembly tasks. CADE-HRI integrates diverse sensor inputs—natural language, gesture, real-time visual perception (e.g., object pose, gaze), and force/torque feedback—fusing them into a Multi-modal Large Language Model (MM-LLM). The MM-LLM serves as central intelligence, orchestrating dynamic task planning, autonomous adaptation to anomalies, and intelligent conflict resolution to generate robust robot actions. Our methodology emphasizes system integration and prompt engineering with pre-trained models. Experimental validation, using fictitious data, demonstrates CADE-HRI significantly outperforms traditional scripted, NLP-Only, and VLM-Adapt baselines in task completion, efficiency, and robustness across complex assembly tasks with dynamic changes and ambiguous instructions. Human-centric evaluations indicate superior user satisfaction, and ablation studies confirm the synergistic contribution of multi-modal inputs. This work affirms the efficacy of integrating multi-modal perception with MM-LLM-driven dynamic planning to enhance collaborative robot performance and user experience in complex, unstructured workspaces.

Version published to 10.20944/preprints202603.1482.v1
Mar 18, 2026

Augmented Reality-Based Training System Using Multimodal Language Model for Context-Aware Guidance and Activity Recognition in Complex Machine Operations

This article has 2 authors:
1. Waseem Ahmed
2. Qingjin Peng
This article has no evaluationsLatest version Mar 5, 2026
Context-Rich Adaptive Embodied Agents: Enhancing LLM-Powered Task Planning and Memory in Home Robotics

This article has 2 authors:
1. Yutian Gai
2. Haoyu Cen
This article has no evaluationsLatest version Mar 5, 2026
Large-Scale Model-Enhanced Vision-Language Navigation: Recent Advances, Practical Applications, and Future Challenges

This article has 5 authors:
1. Zecheng Li
2. Xiaolin Meng
3. Xu He
4. Youdong Zhang
5. Wenxuan Yin
This article has no evaluationsLatest version Feb 10, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Augmented Reality-Based Training System Using Multimodal Language Model for Context-Aware Guidance and Activity Recognition in Complex Machine Operations

Context-Rich Adaptive Embodied Agents: Enhancing LLM-Powered Task Planning and Memory in Home Robotics

Large-Scale Model-Enhanced Vision-Language Navigation: Recent Advances, Practical Applications, and Future Challenges