LinGen-Uni: A Universal Linear-Complexity Framework for High-Resolution Minute-Length Text-to-Video Generation
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Text-to-video generation enhances content creation but is highly computationally intensive: The computational cost of Diffusion Transformers (DiTs) scales quadratically in the number of pixels. This makes minute-length video generation extremely expensive, limiting most existing models to generating videos of only 10-20 seconds length. We propose a Linear-complexity text-to-video Generation (LinGen) framework whose cost scales linearly in the number of pixels. For the first time, LinGen enables high-resolution minute-length video generation on a single GPU without compromising quality. It replaces the computationally-dominant and quadratic-complexity block, self-attention, with a linear-complexity block called MATE, which consists of an MA-branch and a TE-branch. The MA-branch targets short-to-long-range correlations, combining a bidirectional Mamba2 block with our token rearrangement method, Rotary Major Scan, and our review tokens developed for long video generation. The TE-branch is a novel TEmporal Swin Attention block that focuses on temporal correlations between adjacent tokens and medium-range tokens. The MATE block addresses the adjacency preservation issue of Mamba and improves the consistency of generated videos significantly. We further propose a distillation framework that quickly transfers the self-attention layers in pre-trained DiTs to our proposed MATE layers by reusing the self-attention weights to initialize 90% of the weights of MATE layers. Benefiting from this, our proposed LinGen can be universally deployed on any pre-trained DiT through light distillation. Thus, we call LinGen equipped with this distillation framework as LinGen-Uni. Experimental results show that LinGen outperforms DiT (with a 75.6% win rate) in video quality with up to 15x (11.5x) FLOPs (latency) reduction. Furthermore, both automatic metrics and human evaluation demonstrate that our LinGen-4B yields comparable video quality to state-of-the-art models (with a 50.5%, 52.1%, 49.1% win rate with respect to Gen-3, LumaLabs, and Kling, respectively). This paves the way for hour-length movie generation and real-time interactive video generation. Furthermore, distillation results indicate that LinGen-Uni maintains the quality of Wan2.1-T2V-1.3B after distillation while achieving up to 30.7x speedup in terms of inference latency, outperforming LTX-Video-2B significantly in terms of video quality and text-video alignment. More minute-length video examples can be found at our project website: https://lineargen.github.io/. The complete code of the distillation framework will be released soon after acceptance.