ByteDance Teams with He Kaiming to Open‑Source the Continuous Diffusion Language Model Cola DLM
The article analyzes ByteDance's Cola DLM, a fully open‑source continuous diffusion language model that abandons token‑centric generation in favor of latent semantic representations, detailing its architecture, training strategy, scaling stability, and how it compares with the earlier ELF model.
Motivation: beyond token prediction
Both ByteDance and He Kaiming argue that large language models need not predict the next token; the focus should be on continuous semantic representations.
Cola DLM’s motivation is representation, not diffusion.
Tokens are surface carriers; the same meaning can be expressed by different phrasings such as “我今天很开心”, “今天我心情很好”, “今天过得挺愉快”. Traditional models treat each phrasing as separate patterns, whereas Cola DLM aims for a stable abstract semantic state that unifies them.
Core architecture
Cola DLM consists of two components:
Latent prior that generates a continuous semantic latent variable.
Decoder that translates the latent into concrete text.
The diffusion/flow‑matching process operates entirely in latent space, transporting a simple distribution (e.g., Gaussian) into the learned semantic latent distribution without step‑by‑step token generation.
Key design details
Latent is not a simple embedding. Cola DLM uses a Text VAE: an encoder compresses discrete text into a continuous “semantic fingerprint”, and a decoder reconstructs text from this latent. Unlike token embeddings tied to individual tokens, the latent variable can vary continuously and be modeled probabilistically.
Prior uses block‑causal DiT + Flow Matching. Instead of the classic “add noise → denoise” diffusion, the prior learns a vector field that directly transports a simple distribution into the real latent distribution. The block‑causal architecture handles local semantics in parallel while preserving global causal order.
Clear separation of training roles. Training splits the tasks: the encoder/decoder focus solely on converting text ↔ semantic latent, while the prior (DiT + Flow Matching) learns to generate the latent from noise. During diffusion the encoder is frozen, and a BERT‑style mask loss guards against semantic collapse.
Three diagnostic sub‑objectives. The overall loss is decomposed into:
Reconstruction ability – can the decoder recover the original text from the latent?
Compression ability – how much information does the latent retain?
Fitting ability – can the prior accurately model the true latent distribution?
This decomposition lets researchers pinpoint failures (e.g., poor reconstruction vs. weak prior) rather than blaming a monolithic “next‑token” loss.
Empirical findings
In strict scaling experiments (~2 B parameters, ~2000 EFLOPs), Cola DLM shows a more stable scaling trend than autoregressive models and mainstream discrete diffusion language models.
Compared with the earlier ELF model (which bypasses token space with 105 M parameters), Cola DLM separates semantic generation from textual rendering, offering a modular approach that aligns better with multimodal generation pipelines.
Broader implications
Moving generation to a continuous semantic latent space enables tighter integration with image, video, and audio modalities, which are naturally continuous.
It is not merely another competitor in the diffusion‑language‑model race; it is a bridge that connects text to a continuous multimodal world.
Key comparisons with ELF
Both ELF and Cola DLM challenge the assumption that language models must be built on discrete tokens. ELF performs diffusion directly in the original embedding space, while Cola DLM splits the pipeline into a semantic prior and a separate decoder, allowing parallel local semantic organization (block‑causal) and global causal consistency.
Resources
HuggingFace model hub: https://huggingface.co/ByteDance-Seed/Cola-DLM
GitHub repository: https://github.com/ByteDance-Seed/Cola-DLM
Paper (arXiv): https://arxiv.org/abs/2605.06548
Blog post: https://hongcanguo.github.io/posts/2026-cola-dlm-zh.html
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
AIWalker
Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
