Understanding Stable Diffusion Architecture and Implementing It with the Diffusers Library
This article reviews the evolution from GANs to diffusion models, explains the components of Stable Diffusion—including the CLIP text encoder, VAE, and UNet—and provides step‑by‑step Python code using HuggingFace's Diffusers library to generate images from text prompts.
1. Introduction
Reviewing the history of AI‑generated art, GANs were the first breakthrough, but recent progress is dominated by Diffusion Models (DM). Stable Diffusion, the most popular open‑source DM, powers many community projects such as WebUI, ComfyUI, Fooocus, Civitai, and the HuggingFace Diffusers library.
2. Network Structure
Stable Diffusion consists of three main sub‑networks: a CLIP‑based text encoder, a UNet noise‑prediction model, and a VAE for latent‑space compression and decoding.
2.1 Overall Architecture
The generation pipeline follows these steps:
Encode the input text with CLIP to obtain a text embedding.
Sample a random latent tensor from a normal distribution.
Feed the latent and text embedding into UNet to predict noise.
Subtract the predicted noise from the latent.
Repeat steps 3‑4 for many denoising iterations.
Decode the final latent with the VAE decoder to produce the image.
2.2 Text Encoder
Stable Diffusion uses OpenAI's CLIP model rather than a generic BERT encoder because CLIP aligns image and text representations, enabling more faithful text‑conditioned generation.
2.3 VAE Model
The VAE compresses images into a lower‑dimensional latent space, reducing computational cost. It consists of an encoder that maps an image x to a mean μ and variance σ , samples z , and a decoder that reconstructs the image from z .
2.4 UNet Model
UNet acts as a noise‑prediction network. During generation it repeatedly receives a noisy latent, predicts the added noise, and the scheduler removes it, gradually denoising the latent until a clean image is obtained.
3. Diffusers Module
The HuggingFace diffusers library provides ready‑made pipelines and low‑level components to implement the above steps.
3.1 Using a Pipeline
A simple one‑liner can generate an image from a text prompt:
from diffusers import AutoPipelineForText2Image
import torch
pipeline = AutoPipelineForText2Image.from_pretrained(
"runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16"
).to("cuda")
image = pipeline("stained glass of darth vader, backlight, centered composition, masterpiece, photorealistic, 8k").images[0]
image3.2 Loading Individual Components
For finer control, each sub‑module can be loaded separately:
from tqdm.auto import tqdm
from PIL import Image
import torch
from transformers import CLIPTextModel, CLIPTokenizer
from diffusers import AutoencoderKL, UNet2DConditionModel, DDPMScheduler
model_path = "runwayml/stable-diffusion-v1-5"
vae = AutoencoderKL.from_pretrained(model_path, subfolder="vae")
tokenizer = CLIPTokenizer.from_pretrained(model_path, subfolder="tokenizer")
text_encoder = CLIPTextModel.from_pretrained(model_path, subfolder="text_encoder")
unet = UNet2DConditionModel.from_pretrained(model_path, subfolder="unet")
scheduler = DDPMScheduler.from_pretrained(model_path, subfolder="scheduler")
torch_device = "cuda"
vae.to(torch_device)
text_encoder.to(torch_device)
unet.to(torch_device)3.3 Encoding the Prompt
Tokenize the prompt and obtain text embeddings:
prompt = ["a photograph of an astronaut riding a horse"]
height = 512
width = 512
num_inference_steps = 25
guidance_scale = 7.5
batch_size = len(prompt)
text_input = tokenizer(
prompt, padding="max_length", max_length=tokenizer.model_max_length,
truncation=True, return_tensors="pt"
)
with torch.no_grad():
text_embeddings = text_encoder(text_input.input_ids.to(torch_device))[0]3.4 Getting the Latent Variable
Sample random noise and scale it according to the scheduler:
latents = torch.randn(
(batch_size, unet.config.in_channels, height // 8, width // 8),
device=torch_device
)
latents = latents * scheduler.init_noise_sigma3.5 Denoising Loop
Iteratively run UNet, predict noise, and let the scheduler update the latent:
scheduler.set_timesteps(num_inference_steps)
for t in tqdm(scheduler.timesteps):
latent_model_input = latents
latent_model_input = scheduler.scale_model_input(latent_model_input, timestep=t)
with torch.no_grad():
noise_pred = unet(
latent_model_input, t, encoder_hidden_states=text_embeddings
).sample
latents = scheduler.step(noise_pred, t, latents).prev_sample3.6 Decoding with VAE
Transform the denoised latent back to pixel space:
latents = 1 / 0.18215 * latents
with torch.no_grad():
image = vae.decode(latents).sample
image = (image / 2 + 0.5).clamp(0, 1).squeeze()
image = (image.permute(1, 2, 0) * 255).to(torch.uint8).cpu().numpy()
image = Image.fromarray(image)
image.show()4. Conclusion
The tutorial traced AI‑painting from early GANs to modern diffusion models, detailed the internal architecture of Stable Diffusion, and demonstrated a complete end‑to‑end implementation using the Diffusers library. Future extensions may include LoRA, ControlNet, and other conditioning techniques.
Rare Earth Juejin Tech Community
Juejin, a tech community that helps developers grow.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.