CASE: Confusion-Aware Semantic Enhancement for Multi-Object Text-to-Image Generation

Xiaopeng Cao
Tian Zhang
Hailong Ning
Chunyang Zhao
Yizhuo Dong
Haojie Li

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Diffusion models have significantly improved the visual quality of text-to-image (T2I) synthesis.However, representative models such as Stable Diffusion still suffer from object omission in multi-object scenarios.First, geometric relationships among objects in the embedding space may induce semantic confusion, which hinders the model’s ability to distinguish semantically similar objects.Second, the model often overemphasizes certain objects, which leads to imbalanced attention allocation across multiple objects.To address these issues, we propose a Confusion-Aware Semantic Enhancement (CASE) approach for T2I generation.To mitigate semantic confusion in the embedding space, we design a Confusion-Aware Embedding Decoupling (CAED) mechanism to enhance the semantic separability of geometrically proximate objects.By explicitly enlarging inter-object embedding distances, CAED strengthens the model’s ability to capture the structural semantics of multi-object prompts.To address imbalanced attention allocation in multi-object scenarios, we propose a Confusion-Aware Attention Separation (CAAS) mechanism that enhances object discriminability during denoising and encourages more stable attention distributions.Extensive experiments on multiple T2I benchmarks and different versions of Stable Diffusion demonstrate consistent improvements across a wide range of evaluation metrics.

Version published to 10.21203/rs.3.rs-9109704/v1 on Research Square
Mar 25, 2026

Qwen-Edit+: Scaling Image Editing with VLM-Guided Consistency and Aesthetic Preference Distillation

This article has 2 authors:
1. Fan Tang
2. Siyuan Li
This article has no evaluationsLatest version Apr 9, 2026
Weakly Supervised Semantic Segmentation Based on Subspace- Decoupled Representations and Cross-Layer CAM Structural Alignment

This article has 4 authors:
1. Kaiyang Liao
2. Junwen Pang
3. Yuanlin Zheng
4. Yunfei Tan
This article has no evaluationsLatest version Apr 2, 2026
Label-Graph Guided Semantic Alignment for Multi-Class Remote Sensing Image Recognition

This article has 4 authors:
1. KUN ZHOU
2. LIWEI ZHU
3. YI ZHANG
4. CUNCUN WEI
This article has no evaluationsLatest version Apr 2, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Qwen-Edit+: Scaling Image Editing with VLM-Guided Consistency and Aesthetic Preference Distillation

Weakly Supervised Semantic Segmentation Based on Subspace- Decoupled Representations and Cross-Layer CAM Structural Alignment

Label-Graph Guided Semantic Alignment for Multi-Class Remote Sensing Image Recognition