LinGen-Uni: A Universal Linear-Complexity Framework for High-Resolution Minute-Length Text-to-Video Generation

Hongjie Wang
Chih-Yao Ma
Yen-Cheng Liu
Ji Hou
Tao Xu
Jialiang Wang
Felix Juefei-Xu
Yaqiao Luo
Peizhao Zhang
Tingbo Hou
Peter Vajda
Xiaoliang Dai
Niraj K. Jha

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Text-to-video generation enhances content creation but is highly computationally intensive: The computational cost of Diffusion Transformers (DiTs) scales quadratically in the number of pixels. This makes minute-length video generation extremely expensive, limiting most existing models to generating videos of only 10-20 seconds length. We propose a Linear-complexity text-to-video Generation (LinGen) framework whose cost scales linearly in the number of pixels. For the first time, LinGen enables high-resolution minute-length video generation on a single GPU without compromising quality. It replaces the computationally-dominant and quadratic-complexity block, self-attention, with a linear-complexity block called MATE, which consists of an MA-branch and a TE-branch. The MA-branch targets short-to-long-range correlations, combining a bidirectional Mamba2 block with our token rearrangement method, Rotary Major Scan, and our review tokens developed for long video generation. The TE-branch is a novel TEmporal Swin Attention block that focuses on temporal correlations between adjacent tokens and medium-range tokens. The MATE block addresses the adjacency preservation issue of Mamba and improves the consistency of generated videos significantly. We further propose a distillation framework that quickly transfers the self-attention layers in pre-trained DiTs to our proposed MATE layers by reusing the self-attention weights to initialize 90% of the weights of MATE layers. Benefiting from this, our proposed LinGen can be universally deployed on any pre-trained DiT through light distillation. Thus, we call LinGen equipped with this distillation framework as LinGen-Uni. Experimental results show that LinGen outperforms DiT (with a 75.6% win rate) in video quality with up to 15x (11.5x) FLOPs (latency) reduction. Furthermore, both automatic metrics and human evaluation demonstrate that our LinGen-4B yields comparable video quality to state-of-the-art models (with a 50.5%, 52.1%, 49.1% win rate with respect to Gen-3, LumaLabs, and Kling, respectively). This paves the way for hour-length movie generation and real-time interactive video generation. Furthermore, distillation results indicate that LinGen-Uni maintains the quality of Wan2.1-T2V-1.3B after distillation while achieving up to 30.7x speedup in terms of inference latency, outperforming LTX-Video-2B significantly in terms of video quality and text-video alignment. More minute-length video examples can be found at our project website: https://lineargen.github.io/. The complete code of the distillation framework will be released soon after acceptance.

Version published to 10.21203/rs.3.rs-7159495/v1 on Research Square
Jul 21, 2025

Zero-Shot Image Super-Resolution Using Prompt-Driven Vision-Language Foundation Models Without Task-Specific Fine-Tuning

This article has 1 author:
1. K. AKILA
This article has no evaluationsLatest version Sep 1, 2025
Zarvan: An Efficient Gated Architecture for Sequence Modeling with Linear Complexity

This article has 1 author:
1. Yasser Sajjadi
This article has no evaluationsLatest version Jul 30, 2025
Advancements in Talking Head Generation: A Comprehensive Review of Techniques, Metrics, and Challenges

This article has 6 authors:
1. Vineet Kumar Rakesh
2. Soumya Mazumdar
3. Research Pratim Maity
4. Sarbajit Pal
5. Amitabha Das
6. Tapas Samanta
This article has no evaluationsLatest version Sep 4, 2025

Listed in

Abstract

Article activity feed

Related articles

Zero-Shot Image Super-Resolution Using Prompt-Driven Vision-Language Foundation Models Without Task-Specific Fine-Tuning

Zarvan: An Efficient Gated Architecture for Sequence Modeling with Linear Complexity

Advancements in Talking Head Generation: A Comprehensive Review of Techniques, Metrics, and Challenges