Collaborative Assembly with Dynamic Environment for Human-Robot Interaction via Multi-Modal Large Language Model
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Human-Robot Collaboration (HRC) holds significant potential but is hindered by real-world complexity, dynamism, and ambiguous human instructions. This paper introduces CADE-HRI, a novel multi-modal HRC system enabling natural and flexible interaction for assembly tasks. CADE-HRI integrates diverse sensor inputs—natural language, gesture, real-time visual perception (e.g., object pose, gaze), and force/torque feedback—fusing them into a Multi-modal Large Language Model (MM-LLM). The MM-LLM serves as central intelligence, orchestrating dynamic task planning, autonomous adaptation to anomalies, and intelligent conflict resolution to generate robust robot actions. Our methodology emphasizes system integration and prompt engineering with pre-trained models. Experimental validation, using fictitious data, demonstrates CADE-HRI significantly outperforms traditional scripted, NLP-Only, and VLM-Adapt baselines in task completion, efficiency, and robustness across complex assembly tasks with dynamic changes and ambiguous instructions. Human-centric evaluations indicate superior user satisfaction, and ablation studies confirm the synergistic contribution of multi-modal inputs. This work affirms the efficacy of integrating multi-modal perception with MM-LLM-driven dynamic planning to enhance collaborative robot performance and user experience in complex, unstructured workspaces.