Skip to content

01. Image Generation (Text-to-Image)

Text-to-Image (T2I) is the most fundamental AI image generation method, creating images from text alone. When you describe a desired scene in words, such as “A cat sitting on a rainbow,” the AI model interprets the description and generates an image.

In this chapter, you will learn:

  • The basic workflow structure for Text-to-Image
  • Key features of major image models like Flux, SDXL, and Qwen-Image
  • A model selection guide based on your purpose and environment

All Text-to-Image workflows share the five stages below. While the specific node names and settings differ by model, the overall flow remains the same.

StepWhat It DoesRelated Nodes (Examples)
1. Load ModelSelect and load the model needed for image generationCheckpointLoaderSimple, UNETLoader, CLIPLoader, VAELoader
2. Text EncodingConvert the text describing the desired image into a format the model can understandCLIPTextEncode, CLIPTextEncodeFlux
3. Empty Latent ImageSet the width and height for image generationEmptyLatentImage, EmptySD3LatentImage
4. SamplingImage generation in progressKSampler
5. Decoding/SavingSave the generated imageVAEDecode, SaveImage

Image Model Loading Methods: Unified vs. Separate

Section titled “Image Model Loading Methods: Unified vs. Separate”

There are two main approaches to loading models.

Unified Loading (CheckpointLoaderSimple) A single file (.safetensors) contains UNET + CLIP + VAE all together, so a single node loads all components at once. Flux Schnell fp8, SDXL, SD3.5, and others use this method.

Separate Loading (UNETLoader/Load diffusion model + CLIPLoader + VAELoader) Each component is loaded individually from separate files. This allows you to freely swap model combinations and choose the precision you need (fp16, fp8, etc.). Flux, Qwen, ZIT, and other recent models use this method.

MethodFeatures
KSamplerConfigures steps, cfg, sampler, scheduler all in one node. Simple and intuitive

Using a scene of someone working on a laptop at a Seoul cafe in spring as an example:

Sentence-style: A candid, photo-realistic scene inside a cozy Seoul cafe in spring: a person working on a laptop by a window with soft morning sunlight, cherry blossoms faintly visible outside, shallow depth of field, warm natural tones, 35mm lens look, high detail, no text, no logos

Tag-style: photorealistic, candid, Seoul cafe, spring, laptop, window seat, soft morning light, cherry blossoms outside, shallow depth of field, warm tones, 35mm, high detail, cozy atmosphere, no text, no logos


Sentence-style prompts are recommended for these models (tag-style prompts can be used but may result in reduced performance).

ModelSpeedQuality
Flux.1 SchnellVery FastAverage
Flux.1 DevAverageGood
Flux.2 DevAverage-FastExcellent
Flux.2 Klein 4BVery FastGood
Flux.2 Klein 9BFastExcellent
Flux.1 Krea DevAverageGood

These models support tag-style prompts only.

ModelSpeedQuality
SDXLAverageExcellent
SDXL TurboVery FastAverage

Qwen-Image is a model series with excellent multilingual prompt support. It uses a text encoder based on Qwen 2.5 VL, allowing you to write prompts in various languages including Korean, Chinese, and English.

ModelSpeedQuality
Qwen-Image 20BFastHigh
Qwen-Image 2512FlexibleHighest
Qwen-Image 2512 TurboVery FastHighest
Qwen-Image-Edit-2509FlexibleHigh
Qwen-Image-Edit-2512FlexibleHighest

Z-Image-Turbo

  • Uses Qwen 3 4B text encoder
  • Fast generation in 8 steps
  • Optimized for photorealistic images

Z-Image-Base

  • Uses Qwen 3 4B text encoder
  • Optimized for photorealistic images

If you are having trouble choosing a model, refer to the table below.

PurposeRecommended ModelReason
Getting startedZ-Image-Turbo8 steps, fast speed, Getting Started default model
Fast generation / Low-spec / LightweightFlux.1 Schnell4 steps, fast speed, Apache 2.0
General-purpose high qualityFlux.1 Dev, Z-Image-TurboBalanced quality and speed
Photorealistic photosFlux.1 Krea DevFine-tuned for realistic photos
Text renderingQwen-Image 2512, Flux.2 DevAccurately generates text within images
Anime / IllustrationSDXL (Illustrious-XL, NoobAI-XL)Specialized for anime styles
Korean-language promptsQwen-Image seriesSupports Korean, Chinese, English, and more

If you are starting Text-to-Image with ComfyUI for the first time, we recommend the following order:

  1. Z-Image-Turbo to understand the basic pipeline
  2. Flux.1 Dev or Flux.2 Dev for high-quality generation
    • Generate higher-quality results with more refined prompts
  3. Expand to specialized models based on your needs
    • If you need anime-style output, try SDXL
    • If you want to try Korean-language prompts, try Qwen-Image

  • All Text to Image workflows follow the same flow: Load Model -> Text Encoding -> Sampling -> Decoding -> Save.
  • The optimal steps, CFG, sampler, and scheduler values differ by model. Start with the template defaults and adjust gradually.