3D-to-4D Gaussian Scene Generation with Text-guided Diffusion
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The advent of 3D Gaussian Splatting (3DGS) has enabled real-time, photorealistic renderingof static 3D scenes. The next frontier is to instill these static worlds with dynamic, controllablemotion, a task central to the future of immersive media and simulation. A promisingparadigm for this 4D content creation is to leverage the vast generative power of pre-trainedtext-to-video diffusion models (VDMs) to create motion priors, which can then be "lifted" intoa temporally and spatially consistent 3D scene.However, early frameworks that implement this paradigm, while conceptually powerful,often prove to be fragile and limited in practice. This thesis investigates the practical failuremodes of this diffusion-lifting approach through a deep analysis of the Gaussians-to-Life (G2L)pipeline. We identify two critical bottlenecks that challenge the scalability and usability of suchsystems: (1) a restrictive temporal horizon imposed by the underlying VDM’s architecture,limiting animations to fleeting, sub-second movements; and (2) a critical disconnect betweenthe text prompt and the final motion, revealing a heavy reliance on manually cherry-pickedguidance videos that undermines claims of true text-driven control.In response, this thesis presents a methodology to enhance the robustness and capability ofthis paradigm. We demonstrate that replacing the pipeline’s original U-Net-based VDM witha modern Diffusion Transformer (DiT), LTX-Video, directly addresses the temporal bottleneck,extending the viable animation horizon from 8 to 64 frames, leading to more and bettermotion. Our work provides a more robust and scalable framework for future diffusion-based3D-to-4D animation systems, showing a practical path from promising but fragile concepts tomore functional and powerful creative tools.