Artificial Intelligence 16 min read

Overview of Text‑Controlled Image Generation Models: DALL‑E‑2, Imagen, Latent Stable Diffusion, and ControlNet

This article surveys the key challenges of controllable text‑to‑image generation and explains the architectures, components, and training details of major diffusion‑based models such as DALL‑E‑2, Google Imagen, Stability AI's Latent Stable Diffusion, and the ControlNet extension.

DataFunTalk

Mar 7, 2023

Overview of Text‑Controlled Image Generation Models: DALL‑E‑2, Imagen, Latent Stable Diffusion, and ControlNet

Introduction: The article reviews challenges of controllable image generation in industry and introduces representative text‑controlled diffusion models such as OpenAI’s DALL‑E‑2, Google’s Imagen, Stability AI’s Latent Stable Diffusion, and Stanford’s ControlNet.

DALL‑E‑2: Described as a classifier‑free diffusion model with four stages—text encoder (using CLIP), prior module converting text embeddings to image embeddings, decoder diffusion, and super‑resolution. The four components are trained separately. The text encoder section includes a Python pseudo‑code implementation of CLIP contrastive training:

def clip_training(imgs, texts):
    # imgs and texts batch size should be as large as possible; larger batch improves contrastive learning
    # Image encoder and text encoder map inputs to embeddings of comparable dimension
    img_embedding = img_encoder(imgs)
    txt_embedding = text_encoder(texts)
    norm_img_embedding = tf.nn.l2_normalize(img_embedding, -1)
    norm_txt_embedding = tf.nn.l2_normalize(txt_embedding, -1)
    logits = tf.matmul(norm_txt_embedding, norm_img_embedding, transpose_b=True)
    batch_size = tf.range(tf.shape(imgs)[0])
    label = tf.range(batch_size)
    # Simplified InfoNCE: diagonal entries are positive (label=1), others are negative (label=0)
    loss1 = tf.keras.losses.sparse_categorical_crossentropy(label, logits, from_logits=True)
    # Compute loss for transposed logits to handle symmetric pairs
    loss2 = tf.keras.losses.sparse_categorical_crossentropy(label, tf.transpose(logits), from_logits=True)
    return (loss1 + loss2) / 2.0

Prior module: Explains how the textual condition y is transformed into an image embedding Z_i via a diffusion‑based prior, with loss formulation derived from likelihood p(x|y) and sampling acceleration using DDIM.

Decoder and SR modules: The decoder is the classifier‑free diffusion that generates images from Z_i. A super‑resolution (SR) module first upsamples 64×64 outputs to 256×256 and then to 1024×1024 using diffusion‑based SR techniques.

Imagen: Uses a T5‑XXL text encoder and a text‑to‑image diffusion model (equivalent to DALL‑E‑2’s decoder), followed by two cascaded diffusion‑based super‑resolution stages that also condition on the text embedding. The model applies truncation (threshold or proportional) to keep diffusion outputs within a stable range, with proportional truncation yielding better visual detail.

Latent Stable Diffusion (LDM): Operates in a latent space learned by an auto‑encoder, drastically reducing computation. It employs cross‑attention to fuse conditioning signals and consists of three parts: a latent encoder‑decoder, a conditioning module that can accept arbitrary inputs (images, sketches, text), and a diffusion model that predicts noise in latent space and reconstructs images via the decoder.

ControlNet: Extends LDM by adding trainable zero‑conv branches that accept additional control conditions (edges, sketches, poses, etc.) while keeping the original LDM frozen. This plugin‑style architecture enables fine‑grained control over generated images without retraining the whole diffusion model.

References: Provides URLs to the original papers, blogs, and code repositories for DALL‑E‑2, Imagen, Stable Diffusion, CLIP, and related resources.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI Stable Diffusion text-to-image diffusion models ControlNet DALL-E-2 imagen

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.