StreamDiffusionV2: An Open-Source Streaming System for Real-Time Interactive Video Generation
† Project lead, corresponding to xuchenfeng@berkeley.edu
* This work was done when Tianrui Feng visited UC Berkeley, advised by Chenfeng Xu.
Abstract
Generative models are reshaping the live-streaming industry by redefining how content is created, styled, and delivered. Our first release, StreamDiffusion, has powered efficient and creative live-streaming products (e.g., Daydream, TouchDesigner) but hit limits on temporal consistency due to its image-based design. Recent advances in video diffusion have markedly improved temporal consistency and sampling efficiency for offline generation. However, offline systems primarily optimize throughput via large-batch processing. In contrast, live online streaming faces strict real-time constraints with small batches: every frame must meet a per-frame deadline with low jitter. Moreover, the online system must dynamically adapt its workload (e.g., steps, resolution, effects) to track user query rates and sustain target throughput under varying load.
We bridge this gap with StreamDiffusionV2: a fully open-source, real-time pipeline for interactive live streams with video diffusion models. The system introduces StreamVAE, synced rolling KV, dynamic pipeline parallelism, and a motion-aware noise controller to remain stable under dynamic inputs. To serve users across hardware tiers, StreamDiffusionV2 scales seamlessly across diverse GPU environments and supports flexible denoising steps (e.g., 1–4 steps). Without TensorRT or quantization, StreamDiffusionV2 reaches 42 FPS on 4× H100 with a 4-step model, and 16.6 (resp. 8.6) FPS on 2× (resp. single) RTX 4090 with a 1-step model—making state-of-the-art generative live streaming practical and accessible, from individual creators to enterprise platforms.
Online Streaming Video2Video Transformation
StreamDiffusionV2 robustly supports fast-motion video transfer.
Left: CausVid adapted for streaming. Right: StreamDiffusionV2. Our method maintains style and temporal consistency to a much greater extent. All demos are running on a remote server, and the slight stuttering in the videos is due to network transmission delays (50-300 ms).
StreamDiffusionV2 robustly supports diverse and complex prompts.
Animal-Centric Video Transfer, let your pet begin Live Stream!
StreamDiffusionV2 is ready for your pets real-time Live Stream, the video is captured directly from a camera and processed in real-time for pet youtuber!
Human-Centric Video Transfer, let's begin your Live Stream!
StreamDiffusionV2 is ready for your real-time Live Stream, the video is captured directly from a camera and processed in real-time for youtuber!
Methods
Stream-living pipeline

Fig. 1: The overview pipeline of our StreamDiffusionV2.
StreamDiffusionV2 is a synergy of system- and algorithm-level efforts to achieve stream-living based on video diffusion models. It includes: A dynamic scheduler for pipeline parallelism with stream batch, a StreamVAE and rolling KV, and a motion-aware controller.
Pipeline-parallelism Stream-batch

Fig. 2: The detailed description of our Pipeline-parallelism Stream-batch architecture.
StreamDiffusionV2 aims to cater to users at different scales, from individual creators with one or two GPUs to enterprise platform with more GPUs. We propose pipeline parallelism to offer users various options. This means dividing models among multiple GPUs. Without adaptation, pipeline parallelism alone does not enhance efficiency due to the compute-bound nature of video models. However, partitioning shifts workloads on GPUs to become memory-bound. Transforming the memory-bound inspires us to batch the pipeline for enhanced throughput. We propose a pipeline parallelism with stream batch, as shown in Fig. 2. This approach significantly improves throughput, as shown in Fig. 3 and Fig. 4.
In addition to the static partition, we find that VAE encoding and decoding create an unequal distribution of tasks across GPUs. To enhance throughput, we propose a scheduler that reallocates blocks across devices dynamically using inference-time measurements. These methods allow for competitive real-time generation performance on standard GPUs, thus reducing the barrier for practical implementation.

Fig. 3: FPS results of our system on H100 (with NVLink) and RTX 4090 GPUs (without NVLink), evaluated at a resolution of 832 × 480.

Fig. 4: FPS results of our system on H100 (with NVLink) and RTX 4090 GPUs (without NVLink), evaluated at a resolution of 512 × 512. This resolution is used in all Web UI demos.
Stream-VAE
Stream-VAE is a low-latency implementation of Video VAE for real-time video generation. Unlike current approaches, which process long video sequences and introduce significant latency, Stream-VAE handles a small video chunk each time. Specifically, four video frames are compressed into a single latent frame during the process. In addition, the cached features are utilized in every 3D convolution module of the VAE to maintain temporal consistency. Stream-VAE ensures temporal consistency while supporting efficient live streaming generation.
Rolling KV Cache and Sink Token
We integrate Causal-DiT with Stream-VAE to enable live-streaming video generation. Our rolling KV cache design differs substantially in several aspects: (1) Instead of maintaining a long KV cache, we adopt a much shorter cache length and introduce sink tokens to preserve the generation style during rolling updates. (2) When the current frame's timestamp surpasses the set threshold, we reset it to prevent visual quality degradation from overly large RoPE positions or position indices exceeding the encoding limit. These mechanisms collectively enable our pipeline to achieve truly infinite-length video-to-video live-streaming generation while maintaining stable quality and consistent style.
Motion-aware Noise Controller
In live-streaming applications, high-speed motion frequently occurs, yet current video diffusion models struggle with such motion. To address this, we propose the Motion-aware Noise Controller, a training-free method that adapts noise rates based on input frame motion frequency. Specifically, we assess motion frequency by calculating the mean squared error (MSE) between successive frames and linearly map this to a noise rate using pre-determined statistical parameters. This method balances quality and movement continuity in live video-to-video live streaming.
Acknowledgements
StreamDiffusionV2 is inspired by the prior works StreamDiffusion and StreamV2V. Our Causal DiT builds upon CausVid, and the rolling KV cache design is inspired by Self-Forcing.
We are grateful to the team members of StreamDiffusion for their support.
BibTeX
@article{streamdiffusionv2,
title={StreamDiffusionV2: An Open-Sourced Interactive Diffusion Pipeline for Streaming Applications},
author={Tianrui Feng and Zhi Li and Haocheng Xi and Muyang Li and Shuo Yang and Xiuyu Li and Lvmin Zhang and Kelly Peng and Song Han and Maneesh Agrawala and Kurt Keutzer and Akio Kodaira and Chenfeng Xu},
journal={Project Page},
year={2025},
url={https://streamdiffusionv2.github.io/}
}