A Comparative Study of Large Language Models for Gesture Selection in Virtual Agents
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Co-speech gestures convey a wide variety of meanings and play an important role in face-to-face human interaction, influencing addressees’ engagement, recall, comprehension, and attitudes toward speaker. Similar effects have been observed in interactions with embodied virtual agents, making selection and animation of meaningful gestures a key focus in agent design. Automating this gesture selection process, however, remains challenging. Prior approaches range from fully data-driven techniques—which often struggle to produce contextually meaningful gestures—to more manual methods that rely on handcrafted expertise and lack generalizability. In this paper, we leverage semantic capabilities of Large Language Models (LLMs) to realize an automated gesture selection. We first illustrate information on gestures encoded into LLM, using GPT4 as a high-quality baseline. Building on this, we evaluate alternative prompting approaches for their ability to select meaningful, contextually appropriate gestures aligned to co-speech utterance. While GPT-4 demonstrates strong gesture selection performance, its inference latency makes it unsuitable for real-time interaction. To assess feasibility under real-time constraints, we then evaluate two additional models relative to this baseline: locally deployable Llama3 and a new high-capacity cloud model (GPT-5.2). Our comparative analysis highlights differences in gesture appropriateness and alignment, including GPT5.2’s tendency to over-generate gestures by selecting multiple, inappropriate gestures for a single utterance. In contrast, Llama3 achieves a more favorable balance between gesture quality and inference speed, informing model selection for interactive virtual humans. Finally, we demonstrate how this approach is integrated into a virtual agent system, enabling automated gesture selection and animation during human–agent interaction.