Skip to content

05. Video Generation

AI video generation is a technology that takes text, images, or existing video as input and creates new videos. While image generation produces a single still picture, video generation creates dozens to hundreds of consecutive frames while maintaining temporal consistency.

In this chapter, you will learn:

  • The five types of AI video generation (T2V, I2V, FLF2V, V2V)
  • The basic structure of the video pipeline and how it differs from the image pipeline
  • A model selection guide based on your purpose and environment

AI video generation is divided into several types based on the kind of input data.

For video models, we use the most representative Wan 2.2 model as the standard.

Generates video from text prompts alone. When you describe a scene like “A cat walking through a garden,” the AI creates a moving video.

T2V Workflow

  • Offers the highest degree of freedom, but it can be difficult to precisely control the desired result
  • Most video models support T2V as a baseline

Takes a single still image as input and generates a video where that image comes to life. You can create effects like a person in a photo walking or wind blowing through a landscape.

I2V Workflow

  • Provides high consistency since the starting frame can be precisely maintained

Given two images — a start image and an end image — it generates a video that naturally transitions between them.

FLF2V Workflow

  • Allows precise control of both the start and end, making it ideal for storyboard-based work

PurposeRecommended ModelReason
Getting startedWan 2.1 T2V (1.3B)Lightweight and fast, suitable for understanding the basic pipeline
General high qualityWan 2.2 14B (Turbo)Fast with 4-step turbo while maintaining 14B model quality
Image to VideoWan 2.2 I2VStable I2V output
Video control (V2V)Wan 2.2 Fun ControlCanny-based motion control, style conversion
First/last frame controlWan 2.2 FLF2VPrecisely specify the start and end
VRAMRecommended ModelNotes
8~12GBWan 2.1 T2V (1.3B)Lightweight model, lower resolution/frame count
12~16GBWan 2.2 5B, Wan 2.1 14BMid-range models, leverage fp8 quantization
16~24GBWan 2.2 14B (Turbo)Capable of running most video workflows
24GB+All modelsCan generate at high resolution with many frames

  • Control level increases in the order of T2V -> I2V -> V2V. If text alone is not enough, use image input; if you need even more precise control, use video input.
  • Wan 2.2 is a model that supports various video generation types.
  • Video generation requires significantly more VRAM and time than image generation. Start with low resolution and a small number of frames.