GAME: Learning Multimodal Interactions via Graph Structures for Personality Trait Estimation

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Apparent personality analysis from short videos poses significant challenges due to the complex interplay of visual, auditory, and textual cues. In this paper, we propose GAME, a Graph-Augmented Multimodal Encoder designed to robustly model and fuse multi-source features for automatic personality prediction. For the visual stream, we construct a facial graph and introduce a dual-branch Geo Two-Stream Network, which combines Graph Convolutional Networks (GCNs) and Convolutional Neural Networks (CNNs) with attention mechanisms to capture both structural and appearance-based facial cues. Complementing this, global context and identity features are extracted using pretrained ResNet18 and VGGFace backbones. To capture temporal dynamics, frame-level features are processed by a BiGRU enhanced with temporal attention modules. Meanwhile, audio representations are derived from the VGGish network, and linguistic semantics are captured via the XLM-Roberta transformer. To achieve effective multimodal integration, we propose a Channel Attention-based Fusion module, followed by a Multi-Layer Perceptron (MLP) regression head for predicting personality traits. Extensive experiments show that GAME consistently outperforms existing methods across multiple benchmarks, validating its effectiveness and generalizability.

Article activity feed