SCoPE: Shift-Aware Speaker-Conditioned Priors for Emotion Recognition in Conversations

Burak Can Kaplan
Stefan Wermter

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

In conversations, human emotions are transient; however, they tend to persist across multiple utterances. For example, we rarely switch instantly between contrasting emotions such as happiness and anger. Instead, emotions tend to evolve smoothly, and these patterns are often speaker-specific. Some people might escalate, while others gradually cool down over time. Furthermore, when emotions change during a conversation, they are often driven by contextual factors, such as newly received information or unexpected events. Even though progress has been made in Emotion Recognition in Conversations (ERC), most existing approaches still rely heavily on overt evidence and do not sufficiently model these non-apparent factors. Especially in multimodal settings, this makes these models fragile when the signals are noisy (e.g., occluded faces, slang expressions, or microphone noise). To address these limitations, we introduce Speaker-Conditioned Priors over Emotions (SCoPE). SCoPE is a light weight module that utilizes the emotional history of each speaker and explicitly models their priors for use in subsequent emotion classification. Second, we incorporate emotion shift prediction, a well-established concept in ERC, to guide the model in balancing the priors from SCoPE and multimodal evidence. Finally, we propose a shift-aware fusion mechanism that performs precision-weighted logit integration between multimodal evidence and the speaker prior, forming a Bayesian-inspired product-of-experts formulation. This dynamic fusion allows the model to rely on historical priors when emotions persist and to prioritize multimodal evidence when shifts are likely. Experimental results show our model achieves superior performance over recent state-of-the-art models on the IEMOCAP dataset in multimodal settings.

Version published to 10.21203/rs.3.rs-9065619/v1 on Research Square
Apr 7, 2026

Evaluating Early, Late and Hybrid Fusion in Multimodal Emotion Detection with Pretrained Models

This article has 3 authors:
1. Syed Riyas Ahamed
2. Sandip Saha
3. Awani Bhushan
This article has no evaluationsLatest version Apr 13, 2026
Mapping 99 emotion terms with GPT4 prompting reveals nuanced semantic conceptual structure

This article has 2 authors:
1. Han Ke
2. Eiji Watanabe
This article has no evaluationsLatest version May 15, 2026
The time course of co-speech gesture production: An MEG study

This article has 3 authors:
1. Kazuki Sekine
2. Reiji Ohkuma
3. Hiroshi Ban
This article has no evaluationsLatest version May 7, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Evaluating Early, Late and Hybrid Fusion in Multimodal Emotion Detection with Pretrained Models

Mapping 99 emotion terms with GPT4 prompting reveals nuanced semantic conceptual structure

The time course of co-speech gesture production: An MEG study