A Comparative Study of Large Language Models for Gesture Selection in Virtual Agents

Parisa Ghanad Torshizi
Laura B. Hensel
Ari Shapiro
Stacy C. Marsella

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Co-speech gestures convey a wide variety of meanings and play an important role in face-to-face human interaction, influencing addressees’ engagement, recall, comprehension, and attitudes toward speaker. Similar effects have been observed in interactions with embodied virtual agents, making selection and animation of meaningful gestures a key focus in agent design. Automating this gesture selection process, however, remains challenging. Prior approaches range from fully data-driven techniques—which often struggle to produce contextually meaningful gestures—to more manual methods that rely on handcrafted expertise and lack generalizability. In this paper, we leverage semantic capabilities of Large Language Models (LLMs) to realize an automated gesture selection. We first illustrate information on gestures encoded into LLM, using GPT4 as a high-quality baseline. Building on this, we evaluate alternative prompting approaches for their ability to select meaningful, contextually appropriate gestures aligned to co-speech utterance. While GPT-4 demonstrates strong gesture selection performance, its inference latency makes it unsuitable for real-time interaction. To assess feasibility under real-time constraints, we then evaluate two additional models relative to this baseline: locally deployable Llama3 and a new high-capacity cloud model (GPT-5.2). Our comparative analysis highlights differences in gesture appropriateness and alignment, including GPT5.2’s tendency to over-generate gestures by selecting multiple, inappropriate gestures for a single utterance. In contrast, Llama3 achieves a more favorable balance between gesture quality and inference speed, informing model selection for interactive virtual humans. Finally, we demonstrate how this approach is integrated into a virtual agent system, enabling automated gesture selection and animation during human–agent interaction.

Version published to 10.21203/rs.3.rs-8768097/v1 on Research Square
Mar 20, 2026

Multimodal large language models converge on the human-like geometry of abstract emotion

This article has 7 authors:
1. Huiguang He
2. Changde Du
3. Yizhuo Lu
4. Zhongyu Huang
5. Yi Sun
6. Zisen Zhou
7. Shaozheng Qin
This article has no evaluationsLatest version Apr 2, 2026
Collaborative Assembly with Dynamic Environment for Human-Robot Interaction via Multi-Modal Large Language Model

This article has 2 authors:
1. Kentaro Yamada
2. Nicholas Campbell
This article has no evaluationsLatest version Mar 18, 2026
Lexical alignment and speaker visibility influence gestural alignment in conversation: Implications for theories of linguistic alignment

This article has 3 authors:
1. Sho Akamine
2. antje meyer
3. Asli Ozyurek
This article has no evaluationsLatest version Apr 18, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Multimodal large language models converge on the human-like geometry of abstract emotion

Collaborative Assembly with Dynamic Environment for Human-Robot Interaction via Multi-Modal Large Language Model

Lexical alignment and speaker visibility influence gestural alignment in conversation: Implications for theories of linguistic alignment