01. Image Generation (Text-to-Image)
What This Chapter Covers
Section titled “What This Chapter Covers”Text-to-Image (T2I) is the most fundamental AI image generation method, creating images from text alone. When you describe a desired scene in words, such as “A cat sitting on a rainbow,” the AI model interprets the description and generates an image.
In this chapter, you will learn:
- The basic workflow structure for Text-to-Image
- Key features of major image models like Flux, SDXL, and Qwen-Image
- A model selection guide based on your purpose and environment
Understanding the Basic Pipeline
Section titled “Understanding the Basic Pipeline”All Text-to-Image workflows share the five stages below. While the specific node names and settings differ by model, the overall flow remains the same.
Step-by-Step Explanation
Section titled “Step-by-Step Explanation”| Step | What It Does | Related Nodes (Examples) |
|---|---|---|
| 1. Load Model | Select and load the model needed for image generation | CheckpointLoaderSimple, UNETLoader, CLIPLoader, VAELoader |
| 2. Text Encoding | Convert the text describing the desired image into a format the model can understand | CLIPTextEncode, CLIPTextEncodeFlux |
| 3. Empty Latent Image | Set the width and height for image generation | EmptyLatentImage, EmptySD3LatentImage |
| 4. Sampling | Image generation in progress | KSampler |
| 5. Decoding/Saving | Save the generated image | VAEDecode, SaveImage |
Image Model Loading Methods: Unified vs. Separate
Section titled “Image Model Loading Methods: Unified vs. Separate”There are two main approaches to loading models.
Unified Loading (CheckpointLoaderSimple)
A single file (.safetensors) contains UNET + CLIP + VAE all together, so a single node loads all components at once. Flux Schnell fp8, SDXL, SD3.5, and others use this method.
Separate Loading (UNETLoader/Load diffusion model + CLIPLoader + VAELoader) Each component is loaded individually from separate files. This allows you to freely swap model combinations and choose the precision you need (fp16, fp8, etc.). Flux, Qwen, ZIT, and other recent models use this method.
Sampler Method: KSampler
Section titled “Sampler Method: KSampler”| Method | Features |
|---|---|
| KSampler | Configures steps, cfg, sampler, scheduler all in one node. Simple and intuitive |
Prompt Formats
Section titled “Prompt Formats”Using a scene of someone working on a laptop at a Seoul cafe in spring as an example:
Sentence-style: A candid, photo-realistic scene inside a cozy Seoul cafe in spring: a person working on a laptop by a window with soft morning sunlight, cherry blossoms faintly visible outside, shallow depth of field, warm natural tones, 35mm lens look, high detail, no text, no logos
Tag-style: photorealistic, candid, Seoul cafe, spring, laptop, window seat, soft morning light, cherry blossoms outside, shallow depth of field, warm tones, 35mm, high detail, cozy atmosphere, no text, no logos
Model Overview
Section titled “Model Overview”Flux Series
Section titled “Flux Series”Sentence-style prompts are recommended for these models (tag-style prompts can be used but may result in reduced performance).
| Model | Speed | Quality |
|---|---|---|
| Flux.1 Schnell | Very Fast | Average |
| Flux.1 Dev | Average | Good |
| Flux.2 Dev | Average-Fast | Excellent |
| Flux.2 Klein 4B | Very Fast | Good |
| Flux.2 Klein 9B | Fast | Excellent |
| Flux.1 Krea Dev | Average | Good |
SDXL Series
Section titled “SDXL Series”These models support tag-style prompts only.
| Model | Speed | Quality |
|---|---|---|
| SDXL | Average | Excellent |
| SDXL Turbo | Very Fast | Average |
Qwen Image Series
Section titled “Qwen Image Series”Qwen-Image is a model series with excellent multilingual prompt support. It uses a text encoder based on Qwen 2.5 VL, allowing you to write prompts in various languages including Korean, Chinese, and English.
| Model | Speed | Quality |
|---|---|---|
| Qwen-Image 20B | Fast | High |
| Qwen-Image 2512 | Flexible | Highest |
| Qwen-Image 2512 Turbo | Very Fast | Highest |
| Qwen-Image-Edit-2509 | Flexible | High |
| Qwen-Image-Edit-2512 | Flexible | Highest |
Other Models
Section titled “Other Models”Z-Image-Turbo
- Uses Qwen 3 4B text encoder
- Fast generation in 8 steps
- Optimized for photorealistic images
Z-Image-Base
- Uses Qwen 3 4B text encoder
- Optimized for photorealistic images
Which Model Should You Choose?
Section titled “Which Model Should You Choose?”Recommendations by Purpose
Section titled “Recommendations by Purpose”If you are having trouble choosing a model, refer to the table below.
| Purpose | Recommended Model | Reason |
|---|---|---|
| Getting started | Z-Image-Turbo | 8 steps, fast speed, Getting Started default model |
| Fast generation / Low-spec / Lightweight | Flux.1 Schnell | 4 steps, fast speed, Apache 2.0 |
| General-purpose high quality | Flux.1 Dev, Z-Image-Turbo | Balanced quality and speed |
| Photorealistic photos | Flux.1 Krea Dev | Fine-tuned for realistic photos |
| Text rendering | Qwen-Image 2512, Flux.2 Dev | Accurately generates text within images |
| Anime / Illustration | SDXL (Illustrious-XL, NoobAI-XL) | Specialized for anime styles |
| Korean-language prompts | Qwen-Image series | Supports Korean, Chinese, English, and more |
Recommended Learning Path for Beginners
Section titled “Recommended Learning Path for Beginners”If you are starting Text-to-Image with ComfyUI for the first time, we recommend the following order:
- Z-Image-Turbo to understand the basic pipeline
- Flux.1 Dev or Flux.2 Dev for high-quality generation
- Generate higher-quality results with more refined prompts
- Expand to specialized models based on your needs
- If you need anime-style output, try SDXL
- If you want to try Korean-language prompts, try Qwen-Image
Key Takeaways
Section titled “Key Takeaways”Things to Remember
Section titled “Things to Remember”- All Text to Image workflows follow the same flow:
Load Model -> Text Encoding -> Sampling -> Decoding -> Save. - The optimal steps, CFG, sampler, and scheduler values differ by model. Start with the template defaults and adjust gradually.