Introduction to MotionStream
Traditional diffusion models often take several minutes to generate a short video and typically process all frames in parallel. This approach forces users to wait for rendering completion and prevents viewing intermediate results. MotionStream was developed to solve this limitation.
According to the paper MotionStream: Real-Time Video Generation with Interactive Motion Controls, published by Adobe researchers on arXiv, the model employs a teacher-student architecture consisting of two stages. A bidirectional teacher model generates high-quality motion-controlled videos, which is then distilled into a causal student model capable of streaming video generation in real time.
Thanks to this design, MotionStream achieves sub-second latency and up to ~29 FPS, allowing users to draw motion paths, adjust camera angles, or control object movements and instantly preview the results.
Base Model: Wan 2.1 DiT
The paper notes that the teacher model is built upon Wan DiT, a family of video diffusion transformer models by Alibaba.
There are two primary versions: Wan 2.1 (1.3B parameters) and Wan 2.2 (5B parameters).
Although MotionStream’s source code has not yet been released (its GitHub repo states that Adobe is reviewing it internally), you can still experiment with the teacher model using Wan 2.1 on Hugging Face. Below is a step-by-step installation and usage guide.
Installation Guide for Wan 2.1
1. Environment Setup
Requirements:
- Python ≥ 3.8
- Torch ≥ 2.4
- GPU with at least 8 GB VRAM (for 480p resolution)
Install the required packages:
pip install --upgrade pip
pip install torch --extra-index-url https://download.pytorch.org/whl/cu118
pip install "huggingface_hub[cli]" diffusers
2. Download Source Code and Dependencies
Hugging Face provides the Wan2.1 repository with sample scripts. Clone and install dependencies:
git clone https://github.com/Wan-Video/Wan2.1.git
cd Wan2.1
pip install -r requirements.txt
Ensure you are using torch ≥ 2.4.
3. Download Model Weights
Use the Hugging Face CLI to fetch Wan 2.1 (T2V 1.3B) weights locally:
pip install "huggingface_hub[cli]"
huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B-Diffusers --local-dir ./Wan2.1-T2V-1.3B-Diffusers
Note: The 1.3B version supports 480p resolution and can generate 720p, though stability is best at 480p.
Usage Guide for Wan 2.1
1. Generate Video via Script generate.py
The repository includes a generate.py script for text-to-video generation.
Example: Generate a 5-second video with the T2V 1.3B model:
python generate.py \
--task t2v-1.3B \
--size 832*480 \
--ckpt_dir ./Wan2.1-T2V-1.3B \
--sample_shift 8 \
--sample_guide_scale 6 \
--prompt "A cat wearing boxing gloves standing on a lit stage"
If you encounter an Out-Of-Memory (OOM) error, add:
--offload_model True --t5_cpu
to reduce VRAM usage.
2. Load Model with Diffusers Library
Alternatively, you can load Wan 2.1 directly via the Diffusers library:
import torch
from diffusers import AutoencoderKLWan, WanPipeline
from diffusers.utils import export_to_video
model_id = "Wan-AI/Wan2.1-T2V-1.3B-Diffusers"
vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
pipe = WanPipeline.from_pretrained(model_id, vae=vae, torch_dtype=torch.bfloat16)
pipe.to("cuda")
prompt = "A realistic cat walking on grass"
negative_prompt = "Multiple people, watermark, noise, low quality"
output = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
height=480,
width=832,
num_frames=81,
guidance_scale=5.0
).frames[0]
export_to_video(output, "output.mp4", fps=15)
3. Model Variants Summary
| Task | Supported Resolution | Model | Notes |
|---|---|---|---|
| Text-to-Video | 480p | Wan2.1-T2V-1.3B | Requires ~8 GB VRAM |
| Text-to-Video | 480p / 720p | Wan2.1-T2V-14B | Larger model, multi-GPU required |
| Image-to-Video | 480p / 720p | Wan2.1-I2V-14B | Uses an image as input |
How MotionStream Works
While MotionStream’s codebase has not been officially released, the paper explains its operational design in detail. Below is a summary of its core components:
1. Teacher – Bidirectional Video Generator
Built upon Wan DiT, the teacher model features a light track head that encodes motion trajectories using sinusoidal embeddings, which are injected into both video latents and text embeddings.
It is trained using a Rectified Flow Matching objective and supports conditioning on both text and motion.
2. Student – Causal Streaming Video Generator
To enable real-time performance, the team distilled the teacher into a causal student model using Self-Forcing and Distribution Matching Distillation methods.
The student employs sliding-window causal attention and introduces an attention sink—an anchor frame that retains context from the first frame.
This design allows the model to generate infinitely long videos without drifting.
3. Interactive Motion Control
Users can draw 2D motion paths, drag objects, or adjust camera angles in real time.
The trajectory data is encoded via sinusoidal embeddings and combined with the text prompt, balanced by guidance weights:w_t = 3.0 (text), w_m = 1.5 (motion).
This ensures natural realism while maintaining accurate motion control.
4. Real-Time and Infinite Extension
Thanks to the causal student and attention sink, the model achieves around:
- 17 FPS at 480p,
- 10 FPS at 720p,
and up to 29 FPS with a Tiny VAE decoder.
Conclusion
MotionStream represents a major step toward interactive video generation, enabling real-time motion control and immediate feedback.
Although the official code is not yet available, the paper demonstrates how the teacher-student architecture, based on Wan 2.1 and innovations such as sliding-window causal attention and attention sink, delivers remarkable improvements in speed and quality.
While awaiting the official release, you can experiment with the Wan 2.1-T2V-1.3B model on Hugging Face using the setup and examples provided above.
