TTV-HRM: Hierarchical Reasoning Architecture for Efficient Text-to-Video Generation

Ahsan Umar

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Text-to-video generation is typically inaccessible to most researchers due to its reliance on large-scale models and multi-GPU infrastructure. We introduce the Text-to-Video Hier- archical Reasoning Model (TTV-HRM), a lightweight framework that enables coherent text-conditioned video synthesis on a single commodity GPU. The model employs interleaved hierarchical reasoning in which a high-level transformer captures global semantic structure while a low-level layer refines spatiotem- poral details through bidirectional cross-attention. A learned convergence predictor enables early stopping, reducing average inference iterations from three to 2.1 without quality loss. The 115M-parameter system integrates rotary positional embeddings, SwiGLU feed-forward layers, and a 3D convolutional video autoencoder, training in about four hours on a single NVIDIA T4 GPU at roughly $2 cloud cost, with sub-second inference per clip. On 8-frame 32×32 video generation, TTV-HRM improves frame-wise Fr ́echet Inception Distance from 120.5 to 62.1 across three epochs using only 45 video–text pairs. Results demonstrate semantic alignment, temporal coherence, and object persistence, showing that hierarchical reasoning can substitute for model scale to make text-to-video research more accessible.

Version published to 10.31224/6669
Mar 23, 2026

Speculative Decoding for Multimodal Models: A Survey

This article has 24 authors:
1. Yifan Zhang
2. Yuren Wang
3. Yunta Heish
4. Xin Wang
5. Ping Zhang
6. Ziyi Yang
7. Jianing Ma
8. Zesen Zhao
9. Boyuan Zheng
10. Hei Ting (Una) Chan
11. Jiarui Li
12. Xueshen Liu
13. Kunxiao Gao
14. Yanheng Shang
15. Ruoyan Zhang
16. Ruiyao Liu
17. Jingxuan Zhang
18. Junchen Li
19. Zhongwei Wan
20. Ziheng Zhang
21. Jing Xiong
22. Shatong Zhu
23. Hangrui Cao
24. Hui Shen
This article has no evaluationsLatest version Apr 20, 2026
VideoStylist: Text-to-Consistent Video Stylization with Temporal Anchor Tokens

This article has 2 authors:
1. Hunter Shaw
2. Mark Harris
This article has no evaluationsLatest version Mar 17, 2026
Qwen-Edit+: Scaling Image Editing with VLM-Guided Consistency and Aesthetic Preference Distillation

This article has 2 authors:
1. Fan Tang
2. Siyuan Li
This article has no evaluationsLatest version Apr 9, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Speculative Decoding for Multimodal Models: A Survey

VideoStylist: Text-to-Consistent Video Stylization with Temporal Anchor Tokens

Qwen-Edit+: Scaling Image Editing with VLM-Guided Consistency and Aesthetic Preference Distillation