Emotion-BIND: Multimodal Emotion Recognition and Reasoning in Conversation

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Constructing a Multimodal Emotion Recognition in Conversation (MERC) model is crucial for understanding users’ emotions. Current methods often use linear layers for cross-modal feature alignment, keeping encoders frozen, which can result in feature loss and alignment inaccuracies. Additionally, traditional 1D positional encoding for dynamic content like videos limits the capture of important information.To tackle these issues, we propose a new approach that integrates cross-modal data feature extraction without relying on linear transformations, enabling features to exist in a unified vector space. This significantly improves alignment precision and reduces feature loss. For dynamic content, we introduce the m-ROPE technique, which breaks down positional encoding into three dimensionstime, height, and widthenhancing the models spatial understanding in text, images, and videos. By lowering position ID values for images and videos, we enable better extrapolation during inference for longer sequences.Experimental results demonstrate that our model achieves an unweighted average recall (UAR) of 49.44% and a weighted average recall (WAR) of 71.02% on the DFEM dataset, outperforming others. On the MELD dataset, it reaches a WAR of 63.88%, exceeding the second-best by 4.67%, and excels in recognizing complex emotions like fear and surprise.

Article activity feed