Skip to content Skip to footer

Installation and Usage Guide: MotionStream – Real-Time Video Generation with Motion Control

Introduction to MotionStream

Traditional diffusion models often take several minutes to generate a short video and typically process all frames in parallel. This approach forces users to wait for rendering completion and prevents viewing intermediate results. MotionStream was developed to solve this limitation.

According to the paper MotionStream: Real-Time Video Generation with Interactive Motion Controls, published by Adobe researchers on arXiv, the model employs a teacher-student architecture consisting of two stages. A bidirectional teacher model generates high-quality motion-controlled videos, which is then distilled into a causal student model capable of streaming video generation in real time.

Thanks to this design, MotionStream achieves sub-second latency and up to ~29 FPS, allowing users to draw motion paths, adjust camera angles, or control object movements and instantly preview the results.

Base Model: Wan 2.1 DiT

The paper notes that the teacher model is built upon Wan DiT, a family of video diffusion transformer models by Alibaba.
There are two primary versions: Wan 2.1 (1.3B parameters) and Wan 2.2 (5B parameters).

Although MotionStream’s source code has not yet been released (its GitHub repo states that Adobe is reviewing it internally), you can still experiment with the teacher model using Wan 2.1 on Hugging Face. Below is a step-by-step installation and usage guide.


Installation Guide for Wan 2.1

1. Environment Setup

Requirements:

  • Python ≥ 3.8
  • Torch ≥ 2.4
  • GPU with at least 8 GB VRAM (for 480p resolution)

Install the required packages:

pip install --upgrade pip
pip install torch --extra-index-url https://download.pytorch.org/whl/cu118
pip install "huggingface_hub[cli]" diffusers

2. Download Source Code and Dependencies

Hugging Face provides the Wan2.1 repository with sample scripts. Clone and install dependencies:

git clone https://github.com/Wan-Video/Wan2.1.git
cd Wan2.1
pip install -r requirements.txt

Ensure you are using torch ≥ 2.4.


3. Download Model Weights

Use the Hugging Face CLI to fetch Wan 2.1 (T2V 1.3B) weights locally:

pip install "huggingface_hub[cli]"
huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B-Diffusers --local-dir ./Wan2.1-T2V-1.3B-Diffusers

Note: The 1.3B version supports 480p resolution and can generate 720p, though stability is best at 480p.


Usage Guide for Wan 2.1

1. Generate Video via Script generate.py

The repository includes a generate.py script for text-to-video generation.
Example: Generate a 5-second video with the T2V 1.3B model:

python generate.py \
  --task t2v-1.3B \
  --size 832*480 \
  --ckpt_dir ./Wan2.1-T2V-1.3B \
  --sample_shift 8 \
  --sample_guide_scale 6 \
  --prompt "A cat wearing boxing gloves standing on a lit stage"

If you encounter an Out-Of-Memory (OOM) error, add:

--offload_model True --t5_cpu

to reduce VRAM usage.


2. Load Model with Diffusers Library

Alternatively, you can load Wan 2.1 directly via the Diffusers library:

import torch
from diffusers import AutoencoderKLWan, WanPipeline
from diffusers.utils import export_to_video

model_id = "Wan-AI/Wan2.1-T2V-1.3B-Diffusers"

vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
pipe = WanPipeline.from_pretrained(model_id, vae=vae, torch_dtype=torch.bfloat16)
pipe.to("cuda")

prompt = "A realistic cat walking on grass"
negative_prompt = "Multiple people, watermark, noise, low quality"

output = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    height=480,
    width=832,
    num_frames=81,
    guidance_scale=5.0
).frames[0]

export_to_video(output, "output.mp4", fps=15)

3. Model Variants Summary

TaskSupported ResolutionModelNotes
Text-to-Video480pWan2.1-T2V-1.3BRequires ~8 GB VRAM
Text-to-Video480p / 720pWan2.1-T2V-14BLarger model, multi-GPU required
Image-to-Video480p / 720pWan2.1-I2V-14BUses an image as input

How MotionStream Works

While MotionStream’s codebase has not been officially released, the paper explains its operational design in detail. Below is a summary of its core components:

1. Teacher – Bidirectional Video Generator

Built upon Wan DiT, the teacher model features a light track head that encodes motion trajectories using sinusoidal embeddings, which are injected into both video latents and text embeddings.
It is trained using a Rectified Flow Matching objective and supports conditioning on both text and motion.


2. Student – Causal Streaming Video Generator

To enable real-time performance, the team distilled the teacher into a causal student model using Self-Forcing and Distribution Matching Distillation methods.
The student employs sliding-window causal attention and introduces an attention sink—an anchor frame that retains context from the first frame.
This design allows the model to generate infinitely long videos without drifting.


3. Interactive Motion Control

Users can draw 2D motion paths, drag objects, or adjust camera angles in real time.
The trajectory data is encoded via sinusoidal embeddings and combined with the text prompt, balanced by guidance weights:
w_t = 3.0 (text), w_m = 1.5 (motion).
This ensures natural realism while maintaining accurate motion control.


4. Real-Time and Infinite Extension

Thanks to the causal student and attention sink, the model achieves around:

  • 17 FPS at 480p,
  • 10 FPS at 720p,
    and up to 29 FPS with a Tiny VAE decoder.

Conclusion

MotionStream represents a major step toward interactive video generation, enabling real-time motion control and immediate feedback.
Although the official code is not yet available, the paper demonstrates how the teacher-student architecture, based on Wan 2.1 and innovations such as sliding-window causal attention and attention sink, delivers remarkable improvements in speed and quality.

While awaiting the official release, you can experiment with the Wan 2.1-T2V-1.3B model on Hugging Face using the setup and examples provided above.

Leave a comment