CASE: Confusion-Aware Semantic Enhancement for Multi-Object Text-to-Image Generation
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Diffusion models have significantly improved the visual quality of text-to-image (T2I) synthesis.However, representative models such as Stable Diffusion still suffer from object omission in multi-object scenarios.First, geometric relationships among objects in the embedding space may induce semantic confusion, which hinders the model’s ability to distinguish semantically similar objects.Second, the model often overemphasizes certain objects, which leads to imbalanced attention allocation across multiple objects.To address these issues, we propose a Confusion-Aware Semantic Enhancement (CASE) approach for T2I generation.To mitigate semantic confusion in the embedding space, we design a Confusion-Aware Embedding Decoupling (CAED) mechanism to enhance the semantic separability of geometrically proximate objects.By explicitly enlarging inter-object embedding distances, CAED strengthens the model’s ability to capture the structural semantics of multi-object prompts.To address imbalanced attention allocation in multi-object scenarios, we propose a Confusion-Aware Attention Separation (CAAS) mechanism that enhances object discriminability during denoising and encourages more stable attention distributions.Extensive experiments on multiple T2I benchmarks and different versions of Stable Diffusion demonstrate consistent improvements across a wide range of evaluation metrics.