Understanding AI Image Generation: Diffusion Models, CLIP, and Control Techniques
This guide explains how AI image generators such as Stable Diffusion and DALL·E 3 turn text prompts into pictures by using diffusion models, CLIP‑aligned embeddings, and optional controls like negative prompts, fine‑tuned LoRA checkpoints and ControlNet conditioning, highlighting their differences, workflow, and practical customization.
With concepts such as “AI image creation” and “AI cover” appearing everywhere, many wonder how AI can generate pictures according to textual prompts. This article provides a plain‑language overview of the main principles and workflow, targeting readers who have no prior experience with AI image generation.
Quick comparison of mainstream AI image tools
The current mainstream tools are Stable Diffusion (SD), Midjourney, and DALL·E 3. Excluding the less representative Midjourney, the differences between SD and DALL·E 3 can be illustrated with a few examples.
Example 1
User: “I have a dream when I was young”.
DALL·E 3 response (translated): “At that time you longed for fantasy, the sky, a colorful world…”. The system then generates an image (see image below).
User repeats the same prompt to SD.
SD response: “What are you talking about?”. SD then attempts to draw (see image below).
From this we see that DALL·E 3 tolerates vague or abstract prompts and still produces acceptable images, while SD requires more specific input; otherwise it simply “gives up”.
Example 2
User provides a complex prompt that includes a fantasy‑style base model, a realistic Lora model, and a long list of style descriptors.
DALL·E 3 reply: “Cannot process custom model or Lora”. It then generates an image (see image below).
The same prompt is sent to SD, which replies “Understood!” and produces a highly controllable result (see image below).
Thus DALL·E 3 does not support custom base models or Lora, whereas SD offers strong controllability and customization.
Key differences summarized in the table below:
Semantic Understanding
Ease of Use
Control Ability
Cost
DALL·E 3
Top‑level
Simple
Weak
Paid
Stable Diffusion
Moderate
Some difficulty
Very high
Free
For AI‑generated cover art, where prompts are often abstract and fine‑grained control is not required, DALL·E 3’s strong semantic understanding makes it a better fit; therefore the author’s team uses DALL·E 3 for that scenario.
Understanding the generation process
The article now focuses on Stable Diffusion (SD) as a concrete example because it is open‑source and highly customizable.
Basic principle – Diffusion Model
A diffusion model aims to generate new data that resembles its training set. For SD, this means creating images similar to the training images.
Example: a cat image is repeatedly noised until it becomes pure random noise. The noise‑adding steps are recorded by a “noise predictor”.
To generate a new cat image, we:
Start from a completely random noise image.
Ask the noise predictor what noise should be removed to obtain a cat.
Subtract that predicted noise from the current image.
Repeat the subtraction many times until a clear cat image emerges.
Because the full diffusion process is computationally heavy in pixel space, SD performs diffusion in a lower‑dimensional latent space, enabling local machines to run the model.
Word Embedding
Word embeddings map natural‑language tokens into high‑dimensional vectors, allowing similarity calculations via cosine distance or Euclidean distance.
Example: a model trained on a large novel corpus can recognize that “黄鱼面” (a type of noodle) is closer to “排骨年糕” than to unrelated words, even though it does not understand the real‑world meaning.
Another example: a personal diary model can predict that after typing “我要”, the next likely word is “放假”.
Thus embeddings capture statistical relationships without semantic comprehension.
CLIP
While word embeddings handle text‑to‑text similarity, CLIP (Contrastive Language‑Image Pre‑training) aligns text and image embeddings, enabling the model to know that the word “dog” corresponds to a picture of a dog.
SD’s CLIP consists of a text encoder (tokenizes and embeds the prompt) and an image encoder (pre‑trained on many images). The two encoders produce vectors that are compared via cosine similarity.
When the prompt “a cat” is entered, the following steps occur:
The prompt is tokenized by the CLIP text encoder.
Tokens are converted to embeddings.
Embeddings are matched against image embeddings from the CLIP image encoder.
The diffusion model predicts the noise needed to produce the target image.
The predicted noise is iteratively removed from a random latent image.
The final image (a cat) is decoded.
SD can also perform image‑to‑image generation, but that is omitted here.
Image control techniques
Pure generation often yields results that differ from expectations. To reduce the “lottery” effect, users employ various control mechanisms:
1. Prompt and Negative Prompt
Prompts are the cheapest way to steer generation (e.g., specifying theme, scene, camera angle). Negative prompts tell the model what to avoid.
Example prompt: masterpiece, high quality, a beautiful girl, black long straight hair, pretty face, moonlight
Negative prompt: nsfw, sexy
Adding “leg skin” to the negative prompt reduces the appearance of exposed legs.
2. Main Model
The base model is trained on massive datasets and captures a wide range of styles, but may lack specificity for certain artistic directions. Fine‑tuned or fused models (often called “Lora” or custom checkpoints) provide style‑specific control.
Using the same prompt, different base models produce distinct visual styles (realistic, anime, Chinese‑style, etc.). Sample images are shown below.
3. Lora Models
Lora (Low‑Rank Adaptation) models are lightweight fine‑tuned checkpoints that specialize in a particular character, style, or concept. They require far less data and compute than training a full base model.
Examples: an “iu” character Lora and a “Yae Miko” Lora produce distinct results for the same prompt.
4. ControlNet
ControlNet adds extra conditioning (e.g., pose, depth, edge maps) to guide the diffusion process. By providing a reference image or a simple sketch, users can dictate the pose or composition of the generated result.
Using an open‑pose reference of a dancing bear, the generated image follows that pose.
Drawing a simple horizontal stick figure on the canvas yields images that align with that pose.
These are the most common control methods; others include VAE tweaks, sampler choice, iteration count, etc.
Summarized workflow of AI image generation
Provide a textual prompt (or a reference image for image‑to‑image).
SD’s CLIP encodes the text into embeddings and finds a matching latent representation.
The main model, optional Lora models, and ControlNet steer the diffusion process.
Iteratively remove predicted noise until the final image is produced.
In short, a well‑crafted prompt can generate the desired picture, and with a few additional controls you can dictate character expression, pose, and even train your own personalized models. AI image generation thus brings convenience to both work and daily life.
Ximalaya Technology Team
Official account of Ximalaya's technology team, sharing distilled technical experience and insights to grow together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.