Qwen-Edit+: Scaling Image Editing with VLM-Guided Consistency and Aesthetic Preference Distillation

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Instruction-based image editing has advanced substantially with the emergence of Diffusion Transformers (DiTs). However, a central challenge remains unresolved: how to accurately execute complex editing instructions while preserving the structural consistency and visual quality of the source image. Existing methods are primarily limited by three factors: noisy and imbalanced training data, insufficient structural supervision, and inadequate alignment with human aesthetic preferences. To address these issues, we propose Qwen-Edit+, a unified framework for image editing. Specifically, we first introduce Semantic-Consistency Aware Filtering (SCAF) and Distribution-Adaptive Sampling (DAS) to construct high-quality and category-balanced training data. We then propose a VLM-aware Consistency Loss (VCL), which exploits the hierarchical hidden states of Qwen2.5-VL to provide deep semantic and structural supervision. Finally, we incorporate Aesthetic Preference Distillation (APD) to further improve visual harmony and perceptual quality. In comparative experiments, our method achieved a CLIP Score of 0.347, an LPIPS of 0.219, a PSNR of 25.63, and an Aesthetic Score of 6.31 on Qwen-Consistent-Edit-1.2K, outperforming representative baselines in editability, structural fidelity, and visual quality.

Article activity feed