An Integrated Vision-Audio Architecture for Responsive Humanoid Robot Interaction with Dynamic Attention Switching
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Human-Robot Interaction (HRI) is a rapidly growing field that enables socially meaningful communication between humans and robotic systems. Most current robotic platforms rely heavily on visual or auditory cues, but few achieve seamless integration of both in a dynamic, context-aware manner. Motivated by the need for more natural, human-like interaction, this paper presents the development of an AI-based humanoid Chat Robot designed as a stationary robotic face capable of real-time multimodal interaction. The presented work integrates facial and mouth movement detection using the MediaPipe framework, auditory direction detection using Fast Fourier Transform (FFT) on multi-microphone input, and rule-based voice interaction using a dynamic CSV dataset. A core switching logic governs attention shifts between vision and audio based on environmental cues, ensuring robust and adaptive interaction. The robot's 3D design features a natural, humane facial structure, including servo-controlled eyes, jaw, and neck, which offer expressive motion to reinforce engagement. Evaluation across varied single and multi-user interaction scenarios demonstrates accurate speaker tracking, reliable audio localization, and smooth servo actuation. The system provides a low-cost, modular platform suitable for HRI research, educational applications, and experimental Social Robotics.