LinGen-Uni: A Universal Linear-Complexity Framework for High-Resolution Minute-Length Text-to-Video Generation

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Text-to-video generation enhances content creation but is highly computationally intensive: The computational cost of Diffusion Transformers (DiTs) scales quadratically in the number of pixels. This makes minute-length video generation extremely expensive, limiting most existing models to generating videos of only 10-20 seconds length. We propose a Linear-complexity text-to-video Generation (LinGen) framework whose cost scales linearly in the number of pixels. For the first time, LinGen enables high-resolution minute-length video generation on a single GPU without compromising quality. It replaces the computationally-dominant and quadratic-complexity block, self-attention, with a linear-complexity block called MATE, which consists of an MA-branch and a TE-branch. The MA-branch targets short-to-long-range correlations, combining a bidirectional Mamba2 block with our token rearrangement method, Rotary Major Scan, and our review tokens developed for long video generation. The TE-branch is a novel TEmporal Swin Attention block that focuses on temporal correlations between adjacent tokens and medium-range tokens. The MATE block addresses the adjacency preservation issue of Mamba and improves the consistency of generated videos significantly. We further propose a distillation framework that quickly transfers the self-attention layers in pre-trained DiTs to our proposed MATE layers by reusing the self-attention weights to initialize 90% of the weights of MATE layers. Benefiting from this, our proposed LinGen can be universally deployed on any pre-trained DiT through light distillation. Thus, we call LinGen equipped with this distillation framework as LinGen-Uni. Experimental results show that LinGen outperforms DiT (with a 75.6% win rate) in video quality with up to 15x (11.5x) FLOPs (latency) reduction. Furthermore, both automatic metrics and human evaluation demonstrate that our LinGen-4B yields comparable video quality to state-of-the-art models (with a 50.5%, 52.1%, 49.1% win rate with respect to Gen-3, LumaLabs, and Kling, respectively). This paves the way for hour-length movie generation and real-time interactive video generation. Furthermore, distillation results indicate that LinGen-Uni maintains the quality of Wan2.1-T2V-1.3B after distillation while achieving up to 30.7x speedup in terms of inference latency, outperforming LTX-Video-2B significantly in terms of video quality and text-video alignment. More minute-length video examples can be found at our project website: https://lineargen.github.io/. The complete code of the distillation framework will be released soon after acceptance.

Article activity feed