Real-Time Streaming Text-to-Video Editing with a Diffusion Transformer

Zechen Chu
Ruotong Liao

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The current paradigm of Text-to-Video (T2V) generation struggles with real-time, interactive applications due to models designed for offline, fixed-length video synthesis. This limitation creates challenges in maintaining long-term temporal consistency and achieving low latency for interactive content creation. We introduce StreamEdit-DiT, a novel framework for real-time streaming text-to-video editing. Our approach extensively modifies the Diffusion Transformer (DiT) architecture, incorporating a Multi-Scale Adaptive DiT enhanced with a Progressive Temporal Consistency Module (PTCM) and Dynamic Sparse Attention (DSA) to optimize coherence and computational efficiency. A comprehensive training methodology features Streaming Coherence Matching (SCM) and an Adaptive Sliding Window (ASW) buffer, complemented by a Hierarchical Progressive Distillation strategy for efficient inference. Evaluated on a custom benchmark, StreamEdit-DiT significantly outperforms existing streaming and consistency methods, demonstrating superior prompt adherence, edit fidelity, and overall quality. Crucially, our distilled model achieves high resolution, real-time frame rates, and very low latency on a single H100 GPU, validating its practical applicability for interactive video editing.

Version published to 10.20944/preprints202602.0223.v1
Feb 4, 2026

TempCo-Painter: Temporal Consistency Enhanced Painter with Adaptive Diffusion Transformers for Long Video Inpainting

This article has 2 authors:
1. Ruohan Qi
2. Tianhao Nian
This article has no evaluationsLatest version Feb 5, 2026
MiniCausal-T2V: Towards Ultra-Low Latency and Memory-Efficient Causal Video Generation on Edge Devices

This article has 2 authors:
1. Bowen Long
2. Min-ho Kang
This article has no evaluationsLatest version Feb 6, 2026
Novel Nesting of Deep Learning Domain Transfer and Hybrid Video Coding for Video Compression

This article has 4 authors:
1. Shaohua Jia
2. Wan-Chi Siu
3. Pengyu Liu
4. Kebin Jia
This article has no evaluationsLatest version Jan 14, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

TempCo-Painter: Temporal Consistency Enhanced Painter with Adaptive Diffusion Transformers for Long Video Inpainting

MiniCausal-T2V: Towards Ultra-Low Latency and Memory-Efficient Causal Video Generation on Edge Devices

Novel Nesting of Deep Learning Domain Transfer and Hybrid Video Coding for Video Compression