MiniCausal-T2V: Towards Ultra-Low Latency and Memory-Efficient Causal Video Generation on Edge Devices

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The proliferation of Text-to-Video (T2V) generation technologies has opened new avenues for content creation, yet deploying these advanced models on resource-constrained edge devices remains a significant challenge due to their inherent complexity and high computational demands. This paper introduces MiniCausal-T2V (MCT-Video), an innovative, end-to-end optimized causal latent video diffusion model meticulously engineered for ultra-low latency and memory-efficient T2V generation on edge platforms, particularly Qualcomm Hexagon NPUs. MCT-Video distinguishes itself through a suite of synergistic innovations: a Lightweight Causal Transformer Backbone designed from scratch for intrinsic efficiency and causality, an Adaptive Sparse Temporal Attention mechanism for dynamic temporal computation reduction, Quantization-Aware Fine-tuning for robust precision deployment, a Unified Multi-objective Distillation strategy to holistically transfer knowledge, and Extreme Step Flow-Matching Inference for rapid generation. Extensive experimental evaluations demonstrate that MCT-Video not only achieves superior video quality across comprehensive VBench metrics and human perception but also sets new benchmarks for efficiency, achieving unprecedented end-to-end inference latency and a minimal memory footprint on Hexagon NPUs, substantially outperforming existing edge-optimized solutions. This work represents a significant step towards enabling high-quality, real-time T2V capabilities directly on portable devices.

Article activity feed