MiniCausal-T2V: Towards Ultra-Low Latency and Memory-Efficient Causal Video Generation on Edge Devices

Bowen Long
Min-ho Kang

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The proliferation of Text-to-Video (T2V) generation technologies has opened new avenues for content creation, yet deploying these advanced models on resource-constrained edge devices remains a significant challenge due to their inherent complexity and high computational demands. This paper introduces MiniCausal-T2V (MCT-Video), an innovative, end-to-end optimized causal latent video diffusion model meticulously engineered for ultra-low latency and memory-efficient T2V generation on edge platforms, particularly Qualcomm Hexagon NPUs. MCT-Video distinguishes itself through a suite of synergistic innovations: a Lightweight Causal Transformer Backbone designed from scratch for intrinsic efficiency and causality, an Adaptive Sparse Temporal Attention mechanism for dynamic temporal computation reduction, Quantization-Aware Fine-tuning for robust precision deployment, a Unified Multi-objective Distillation strategy to holistically transfer knowledge, and Extreme Step Flow-Matching Inference for rapid generation. Extensive experimental evaluations demonstrate that MCT-Video not only achieves superior video quality across comprehensive VBench metrics and human perception but also sets new benchmarks for efficiency, achieving unprecedented end-to-end inference latency and a minimal memory footprint on Hexagon NPUs, substantially outperforming existing edge-optimized solutions. This work represents a significant step towards enabling high-quality, real-time T2V capabilities directly on portable devices.

Version published to 10.20944/preprints202602.0468.v1
Feb 6, 2026

Real-Time Streaming Text-to-Video Editing with a Diffusion Transformer

This article has 2 authors:
1. Zechen Chu
2. Ruotong Liao
This article has no evaluationsLatest version Feb 4, 2026
Efficient Layer-wise Attribution Method for ScalableExplainability in VLMs

This article has 5 authors:
1. Sivarama Prasad Tera
2. Ravikumar Chinthaginjala
3. Priya Natha
4. Fadi Al-Turjman
5. Manel Ayadi
This article has no evaluationsLatest version Feb 10, 2026
Novel Nesting of Deep Learning Domain Transfer and Hybrid Video Coding for Video Compression

This article has 4 authors:
1. Shaohua Jia
2. Wan-Chi Siu
3. Pengyu Liu
4. Kebin Jia
This article has no evaluationsLatest version Jan 14, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Real-Time Streaming Text-to-Video Editing with a Diffusion Transformer

Efficient Layer-wise Attribution Method for ScalableExplainability in VLMs

Novel Nesting of Deep Learning Domain Transfer and Hybrid Video Coding for Video Compression