Transformer-based Visual Expression Identification with Recurrent Neural Network

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Creating sophisticated machine learning models to comprehend interactions between individuals can lead to more intuitive user experiences for interactive systems like Amazon Alexa. Beyond basic indicators such as voice modulation and eye movement, a person's combined audio-visual expressions—including vocal intonation and facial gestures—act as subtle cues reflecting the level of engagement in a conversation. This research explores advanced deep learning techniques for the detection of user expressions through audio-visual data. Initially, we develop a foundational audio-visual model incorporating recurrent neural network layers, which demonstrates performance on par with existing leading methods. Subsequently, we introduce a novel transformer-based framework equipped with encoder layers that more effectively fuse audio and visual features for tracking expressions. Evaluation using the Aff-Wild2 dataset reveals that our proposed transformer models outperform the recurrent-based baseline by approximately 2% in accurately identifying arousal and valence metrics. Additionally, our multimodal transformer approaches exhibit notable enhancements compared to unimodal models, achieving performance improvements of up to 3.6%. Comprehensive ablation analyses confirm the crucial role of visual information in the accurate detection of expressions within the Aff-Wild2 dataset. These findings underscore the potential of transformer architectures in advancing the field of expression recognition and enhancing human-computer interaction systems.

Article activity feed