01. Image Generation (Text-to-Image)

What This Chapter Covers

Text-to-Image (T2I) is the most fundamental AI image generation method, creating images from text alone. When you describe a desired scene in words, such as “A cat sitting on a rainbow,” the AI model interprets the description and generates an image.

In this chapter, you will learn:

The basic workflow structure for Text-to-Image
Key features of major image models like Flux, SDXL, and Qwen-Image
A model selection guide based on your purpose and environment

Understanding the Basic Pipeline

All Text-to-Image workflows share the five stages below. While the specific node names and settings differ by model, the overall flow remains the same.

Step-by-Step Explanation

Step	What It Does	Related Nodes (Examples)
1. Load Model	Select and load the model needed for image generation	`CheckpointLoaderSimple`, `UNETLoader`, `CLIPLoader`, `VAELoader`
2. Text Encoding	Convert the text describing the desired image into a format the model can understand	`CLIPTextEncode`, `CLIPTextEncodeFlux`
3. Empty Latent Image	Set the width and height for image generation	`EmptyLatentImage`, `EmptySD3LatentImage`
4. Sampling	Image generation in progress	`KSampler`
5. Decoding/Saving	Save the generated image	`VAEDecode`, `SaveImage`

Image Model Loading Methods: Unified vs. Separate

There are two main approaches to loading models.

Unified Loading (CheckpointLoaderSimple) A single file (.safetensors) contains UNET + CLIP + VAE all together, so a single node loads all components at once. Flux Schnell fp8, SDXL, SD3.5, and others use this method.

Separate Loading (UNETLoader/Load diffusion model + CLIPLoader + VAELoader) Each component is loaded individually from separate files. This allows you to freely swap model combinations and choose the precision you need (fp16, fp8, etc.). Flux, Qwen, ZIT, and other recent models use this method.

Sampler Method: KSampler

Method	Features
KSampler	Configures steps, cfg, sampler, scheduler all in one node. Simple and intuitive

Prompt Formats

Using a scene of someone working on a laptop at a Seoul cafe in spring as an example:

Sentence-style: A candid, photo-realistic scene inside a cozy Seoul cafe in spring: a person working on a laptop by a window with soft morning sunlight, cherry blossoms faintly visible outside, shallow depth of field, warm natural tones, 35mm lens look, high detail, no text, no logos

Tag-style: photorealistic, candid, Seoul cafe, spring, laptop, window seat, soft morning light, cherry blossoms outside, shallow depth of field, warm tones, 35mm, high detail, cozy atmosphere, no text, no logos

Model Overview

Flux Series

Sentence-style prompts are recommended for these models (tag-style prompts can be used but may result in reduced performance).

Model	Speed	Quality
Flux.1 Schnell	Very Fast	Average
Flux.1 Dev	Average	Good
Flux.2 Dev	Average-Fast	Excellent
Flux.2 Klein 4B	Very Fast	Good
Flux.2 Klein 9B	Fast	Excellent
Flux.1 Krea Dev	Average	Good

SDXL Series

These models support tag-style prompts only.

Model	Speed	Quality
SDXL	Average	Excellent
SDXL Turbo	Very Fast	Average

Qwen Image Series

Qwen-Image is a model series with excellent multilingual prompt support. It uses a text encoder based on Qwen 2.5 VL, allowing you to write prompts in various languages including Korean, Chinese, and English.

Model	Speed	Quality
Qwen-Image 20B	Fast	High
Qwen-Image 2512	Flexible	Highest
Qwen-Image 2512 Turbo	Very Fast	Highest
Qwen-Image-Edit-2509	Flexible	High
Qwen-Image-Edit-2512	Flexible	Highest

Other Models

Z-Image-Turbo

Uses Qwen 3 4B text encoder
Fast generation in 8 steps
Optimized for photorealistic images

Z-Image-Base

Uses Qwen 3 4B text encoder
Optimized for photorealistic images

Which Model Should You Choose?

Recommendations by Purpose

If you are having trouble choosing a model, refer to the table below.

Purpose	Recommended Model	Reason
Getting started	Z-Image-Turbo	8 steps, fast speed, Getting Started default model
Fast generation / Low-spec / Lightweight	Flux.1 Schnell	4 steps, fast speed, Apache 2.0
General-purpose high quality	Flux.1 Dev, Z-Image-Turbo	Balanced quality and speed
Photorealistic photos	Flux.1 Krea Dev	Fine-tuned for realistic photos
Text rendering	Qwen-Image 2512, Flux.2 Dev	Accurately generates text within images
Anime / Illustration	SDXL (Illustrious-XL, NoobAI-XL)	Specialized for anime styles
Korean-language prompts	Qwen-Image series	Supports Korean, Chinese, English, and more

Recommended Learning Path for Beginners

If you are starting Text-to-Image with ComfyUI for the first time, we recommend the following order:

Z-Image-Turbo to understand the basic pipeline
- https://nordy.ai/comfyui/?flow=69673b4b1b32b12f62275edc
Flux.1 Dev or Flux.2 Dev for high-quality generation
- Generate higher-quality results with more refined prompts
Expand to specialized models based on your needs
- If you need anime-style output, try SDXL
- If you want to try Korean-language prompts, try Qwen-Image

Key Takeaways

Things to Remember

All Text to Image workflows follow the same flow: Load Model -> Text Encoding -> Sampling -> Decoding -> Save.
The optimal steps, CFG, sampler, and scheduler values differ by model. Start with the template defaults and adjust gradually.