SCoPE: Shift-Aware Speaker-Conditioned Priors for Emotion Recognition in Conversations

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

In conversations, human emotions are transient; however, they tend to persist across multiple utterances. For example, we rarely switch instantly between contrasting emotions such as happiness and anger. Instead, emotions tend to evolve smoothly, and these patterns are often speaker-specific. Some people might escalate, while others gradually cool down over time. Furthermore, when emotions change during a conversation, they are often driven by contextual factors, such as newly received information or unexpected events. Even though progress has been made in Emotion Recognition in Conversations (ERC), most existing approaches still rely heavily on overt evidence and do not sufficiently model these non-apparent factors. Especially in multimodal settings, this makes these models fragile when the signals are noisy (e.g., occluded faces, slang expressions, or microphone noise). To address these limitations, we introduce Speaker-Conditioned Priors over Emotions (SCoPE). SCoPE is a light weight module that utilizes the emotional history of each speaker and explicitly models their priors for use in subsequent emotion classification. Second, we incorporate emotion shift prediction, a well-established concept in ERC, to guide the model in balancing the priors from SCoPE and multimodal evidence. Finally, we propose a shift-aware fusion mechanism that performs precision-weighted logit integration between multimodal evidence and the speaker prior, forming a Bayesian-inspired product-of-experts formulation. This dynamic fusion allows the model to rely on historical priors when emotions persist and to prioritize multimodal evidence when shifts are likely. Experimental results show our model achieves superior performance over recent state-of-the-art models on the IEMOCAP dataset in multimodal settings.

Article activity feed