ByteDance Teams with He Kaiming to Open‑Source the Continuous Diffusion Language Model Cola DLM

The article analyzes ByteDance's Cola DLM, a fully open‑source continuous diffusion language model that abandons token‑centric generation in favor of latent semantic representations, detailing its architecture, training strategy, scaling stability, and how it compares with the earlier ELF model.

AIWalker
AIWalker
AIWalker
ByteDance Teams with He Kaiming to Open‑Source the Continuous Diffusion Language Model Cola DLM

Motivation: beyond token prediction

Both ByteDance and He Kaiming argue that large language models need not predict the next token; the focus should be on continuous semantic representations.

Cola DLM’s motivation is representation, not diffusion.

Tokens are surface carriers; the same meaning can be expressed by different phrasings such as “我今天很开心”, “今天我心情很好”, “今天过得挺愉快”. Traditional models treat each phrasing as separate patterns, whereas Cola DLM aims for a stable abstract semantic state that unifies them.

Core architecture

Cola DLM consists of two components:

Latent prior that generates a continuous semantic latent variable.

Decoder that translates the latent into concrete text.

The diffusion/flow‑matching process operates entirely in latent space, transporting a simple distribution (e.g., Gaussian) into the learned semantic latent distribution without step‑by‑step token generation.

Key design details

Latent is not a simple embedding. Cola DLM uses a Text VAE: an encoder compresses discrete text into a continuous “semantic fingerprint”, and a decoder reconstructs text from this latent. Unlike token embeddings tied to individual tokens, the latent variable can vary continuously and be modeled probabilistically.

Prior uses block‑causal DiT + Flow Matching. Instead of the classic “add noise → denoise” diffusion, the prior learns a vector field that directly transports a simple distribution into the real latent distribution. The block‑causal architecture handles local semantics in parallel while preserving global causal order.

Clear separation of training roles. Training splits the tasks: the encoder/decoder focus solely on converting text ↔ semantic latent, while the prior (DiT + Flow Matching) learns to generate the latent from noise. During diffusion the encoder is frozen, and a BERT‑style mask loss guards against semantic collapse.

Three diagnostic sub‑objectives. The overall loss is decomposed into:

Reconstruction ability – can the decoder recover the original text from the latent?

Compression ability – how much information does the latent retain?

Fitting ability – can the prior accurately model the true latent distribution?

This decomposition lets researchers pinpoint failures (e.g., poor reconstruction vs. weak prior) rather than blaming a monolithic “next‑token” loss.

Empirical findings

In strict scaling experiments (~2 B parameters, ~2000 EFLOPs), Cola DLM shows a more stable scaling trend than autoregressive models and mainstream discrete diffusion language models.

Compared with the earlier ELF model (which bypasses token space with 105 M parameters), Cola DLM separates semantic generation from textual rendering, offering a modular approach that aligns better with multimodal generation pipelines.

Broader implications

Moving generation to a continuous semantic latent space enables tighter integration with image, video, and audio modalities, which are naturally continuous.

It is not merely another competitor in the diffusion‑language‑model race; it is a bridge that connects text to a continuous multimodal world.

Key comparisons with ELF

Both ELF and Cola DLM challenge the assumption that language models must be built on discrete tokens. ELF performs diffusion directly in the original embedding space, while Cola DLM splits the pipeline into a semantic prior and a separate decoder, allowing parallel local semantic organization (block‑causal) and global causal consistency.

Resources

HuggingFace model hub: https://huggingface.co/ByteDance-Seed/Cola-DLM

GitHub repository: https://github.com/ByteDance-Seed/Cola-DLM

Paper (arXiv): https://arxiv.org/abs/2605.06548

Blog post: https://hongcanguo.github.io/posts/2026-cola-dlm-zh.html

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

language modelByteDancecontinuous diffusionCola DLMlatent representation
AIWalker
Written by

AIWalker

Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.