Qwen-Edit+: Scaling Image Editing with VLM-Guided Consistency and Aesthetic Preference Distillation
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Instruction-based image editing has advanced substantially with the emergence of Diffusion Transformers (DiTs). However, a central challenge remains unresolved: how to accurately execute complex editing instructions while preserving the structural consistency and visual quality of the source image. Existing methods are primarily limited by three factors: noisy and imbalanced training data, insufficient structural supervision, and inadequate alignment with human aesthetic preferences. To address these issues, we propose Qwen-Edit+, a unified framework for image editing. Specifically, we first introduce Semantic-Consistency Aware Filtering (SCAF) and Distribution-Adaptive Sampling (DAS) to construct high-quality and category-balanced training data. We then propose a VLM-aware Consistency Loss (VCL), which exploits the hierarchical hidden states of Qwen2.5-VL to provide deep semantic and structural supervision. Finally, we incorporate Aesthetic Preference Distillation (APD) to further improve visual harmony and perceptual quality. In comparative experiments, our method achieved a CLIP Score of 0.347, an LPIPS of 0.219, a PSNR of 25.63, and an Aesthetic Score of 6.31 on Qwen-Consistent-Edit-1.2K, outperforming representative baselines in editability, structural fidelity, and visual quality.