Stable Video Diffusion (SVD) 1.1 is Stability AI's open-source video generation model, enabling high-quality video synthesis from image and text inputs.
## Model Overview
### Key Capabilities - Image-to-Video: Generate 2-4 second video clips from a single image - Text-to-Video: Create videos from text descriptions (with XT model) - Open Source: Full model weights available for local deployment - Customizable: Fine-tune on custom datasets for specific styles - Resolution: Up to 1024x576 native, upscalable
### Model Variants - SVD: Base model, 14 frames at 576x1024 - SVD-XT: Extended model, 25 frames at 576x1024 - SVD-XT 1.1: Improved quality, better motion coherence
## Local Setup
### Requirements ```bash # Hardware requirements # - NVIDIA GPU with 12GB+ VRAM (RTX 3060 minimum) # - 32GB RAM recommended # - 20GB disk space for model weights
# Install dependencies pip install diffusers transformers accelerate torch pip install opencv-python pillow
# Download model from huggingface_hub import snapshot_download snapshot_download("stabilityai/stable-video-diffusion-img2vid-xt-1-1") ```
### Basic Image-to-Video Generation ```python import torch from diffusers import StableVideoDiffusionPipeline from PIL import Image
pipe = StableVideoDiffusionPipeline.from_pretrained( "stabilityai/stable-video-diffusion-img2vid-xt-1-1", torch_dtype=torch.float16, variant="fp16" ) pipe.to("cuda")
# Load conditioning image image = Image.open("input_image.png").resize((1024, 576))
# Generate video frames = pipe( image, num_frames=25, decode_chunk_size=8, motion_bucket_id=127, noise_aug_strength=0.02 ).frames[0]
# Save as video from diffusers.utils import export_to_video export_to_video(frames, "output.mp4", fps=6) ```
## Key Parameters - motion_bucket_id (1-255): Controls amount of motion (higher = more motion) - noise_aug_strength (0-1): Noise added to conditioning image - num_frames: Number of output frames (14 for SVD, 25 for XT) - decode_chunk_size: Memory optimization parameter