Understanding OpenAI's Sora Video Generation Model: Diffusion, Transformers, and Latent Space
OpenAI's Sora video generation model uses latent diffusion, a video compression encoder-decoder, tokenizes spatio-temporal patches, processes them with a diffusion‑trained Transformer conditioned on DALL·E‑style text annotations, then decodes to high‑resolution videos up to a minute long.
OpenAI released the video‑generation model Sora, which can synthesize up to one‑minute high‑quality videos from text prompts at arbitrary resolutions (e.g., 1920×1080 or 1080×1920). The model far exceeds the capabilities of earlier video generators such as Stable Video Diffusion, which were limited to 25‑frame, low‑resolution outputs.
A short technical report accompanying the launch outlines Sora’s architecture and use cases, but does not dive deeply into the underlying mechanisms. The author collected various sources and distilled the core workflow into several steps.
Massive video datasets of varying resolutions and durations are collected and projected into a latent space through dimensionality reduction, where they are annotated with text.
DALL·E 3‑style re‑annotation enriches the textual descriptions, providing more detailed conditioning information.
During generation, random‑noise video clips are passed through a video‑compression network that maps them to low‑dimensional latent representations.
The latent data are split into spatio‑temporal patches, which are linearised into token sequences.
These tokens are fed to a diffusion‑trained Transformer (DiT). The Transformer’s attention mechanism focuses on key prompt words while the diffusion model denoises the tokens iteratively.
The denoised latent tokens are decoded back into pixel‑space video, supporting multiple output resolutions.
If readers only need a high‑level view, they can stop here; otherwise the article proceeds to detailed explanations of each component.
The text‑to‑image part of the discussion revisits Stable Diffusion, which is built on a Latent Diffusion Model (LDM). LDM first trains an auto‑encoder (encoder ε and decoder δ) to compress images into a latent space, then applies a diffusion model (typically a UNet) in that space. Adding a Transformer to the diffusion pipeline enables precise prompt control.
Latent space is an abstract, low‑dimensional representation that captures essential data features. It provides abstraction, dimensionality reduction, denoising, generative capabilities, and supports interpolation and manipulation.
The Stable Diffusion pipeline consists of:
Text encoding via OpenCLIP to obtain a semantic vector.
A diffusion model that gradually transforms random noise into an image conditioned on the text vector.
A final upscaler diffusion model that enhances resolution.
The article then explains the Transformer architecture (self‑attention, multi‑head attention, position encoding) and its historical impact on sequence modeling.
Transitioning to video, the author asks how to extend these ideas. Directly applying Stable Diffusion to each video frame leads to flickering; VideoLDM addresses this by fine‑tuning the decoder to handle temporal dimensions.
Sora’s video‑compression network is a purpose‑built encoder‑decoder that compresses both spatial and temporal dimensions, producing a latent video that serves as the DiT’s input.
Compressed latents are divided into spatio‑temporal patches, which are then linearised into tokens. Position encodings (e.g., (x, y, t)) inform the Transformer of each patch’s location.
The Diffusion Transformer (DiT) processes these tokens, performing denoising conditioned on the text prompt. After all patches are denoised, a visual decoder reconstructs the full‑resolution video.
The training pipeline includes:
Collecting videos and their textual annotations.
Pre‑processing videos (resolution adjustment, format conversion, length trimming).
Using a DALL·E 3‑style model to generate high‑quality descriptive captions for each video.
Training the diffusion model to predict clean patches from noisy ones.
Generating videos by initializing latent patches from noise and iteratively denoising them.
Decoding the final latent into pixel‑space video and applying optional post‑processing.
References are provided to OpenAI’s research page, a Zhihu article, and several WeChat posts.
DeWu Technology
A platform for sharing and discussing tech knowledge, guiding you toward the cloud of technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.