TTV-HRM: Hierarchical Reasoning Architecture for Efficient Text-to-Video Generation

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Text-to-video generation is typically inaccessible to most researchers due to its reliance on large-scale models and multi-GPU infrastructure. We introduce the Text-to-Video Hier- archical Reasoning Model (TTV-HRM), a lightweight framework that enables coherent text-conditioned video synthesis on a single commodity GPU. The model employs interleaved hierarchical reasoning in which a high-level transformer captures global semantic structure while a low-level layer refines spatiotem- poral details through bidirectional cross-attention. A learned convergence predictor enables early stopping, reducing average inference iterations from three to 2.1 without quality loss. The 115M-parameter system integrates rotary positional embeddings, SwiGLU feed-forward layers, and a 3D convolutional video autoencoder, training in about four hours on a single NVIDIA T4 GPU at roughly $2 cloud cost, with sub-second inference per clip. On 8-frame 32×32 video generation, TTV-HRM improves frame-wise Fr ́echet Inception Distance from 120.5 to 62.1 across three epochs using only 45 video–text pairs. Results demonstrate semantic alignment, temporal coherence, and object persistence, showing that hierarchical reasoning can substitute for model scale to make text-to-video research more accessible.

Article activity feed