Dream Your Pose: Robust Human Pose Generation via Uncertainty-Aware Structural Reward Modeling

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Conditional diffusion models offer a versatile paradigm for controllable image synthesis, yet faithfully adhering to intricate spatial constraints, such as human pose skeletons, continues to pose significant challenges. Despite progress in control architectures and reward-guided fine-tuning, outputs frequently exhibit structural aberrations, joint misalignments, and anatomically infeasible configurations. We contend that these shortcomings arise primarily from two core deficiencies: pixel-based rewards inadequately encapsulate the perceptual and topological nuances of skeletal forms, and reward signals grow unreliable amid diverse or out-of-distribution samples.To address this, we introduce Dream Your Pose , a perceptually informed framework for pose-conditioned generation that prioritizes structural fidelity. Our method incorporates a multi-channel, structure-sensitive reward mechanism, harnessing perceptual features like local contrast, edge gradients, and spatial continuity to more accurately gauge pose congruence. Critically, we integrate an uncertainty-aware regularization paradigm—drawing from principles of uncertainty modeling in learning—to adaptively modulate reward influence, thereby mitigating the effects of spurious or ambiguous feedback and fostering robust training dynamics.Rigorous evaluations on the OpenPose-ControlNet dataset reveal substantial gains, including a 25.3% relative uplift in Object Keypoint Similarity (OKS) and a 15.6% enhancement in Probability of Correct Keypoint at 0.5 (PCK@0.5), underscoring improved keypoint precision and holistic skeletal integrity. These advancements yield images with superior visual coherence and anatomical plausibility, without compromising semantic fidelity or perceptual quality.

Article activity feed