Cross-Modal Bias Transfer in Aligned Video Diffusion Models
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Video diffusion models integrate visual, temporal, and textual signals, creating potential pathways for cross-modal bias transfer. This paper studies how alignment tuning affects the transmission of social bias between text and visual modalities in video generation. We evaluate 14,200 text-to-video samples using a cross-modal attribution framework that decomposes bias contributions across input modalities. Quantitative analysis reveals that alignment tuning reduces text-conditioned bias by 24.8%, yet increases visually induced bias carryover by 31.5%, particularly in identity-related scenarios. The results demonstrate that alignment tuning redistributes bias across modalities rather than eliminating it, highlighting the need for modality-aware alignment strategies.