Artificial Intelligence 10 min read

LLaDA and LLaDA‑V: Large Language Diffusion Models and Their Multimodal Extensions

This article presents the LLaDA series of diffusion‑based large language models, explains how their generative‑modeling principle yields language intelligence comparable to autoregressive models, and details the multimodal LLaDA‑V architecture, training methods, experimental results, and broader implications for AI research.

AntTech

Jun 4, 2025

LLaDA and LLaDA‑V: Large Language Diffusion Models and Their Multimodal Extensions

Recent work by researchers at Renmin University and Gaoling introduced a new insight: the language intelligence exhibited by large language models (LLMs)—such as in‑context learning, instruction following, reasoning, and multi‑turn dialogue—originates from the underlying generative modeling principle (maximum likelihood estimation or KL‑divergence minimization) rather than the autoregressive mechanism itself. Building on this insight, they released the diffusion‑based LLM LLaDA and its multimodal counterpart LLaDA‑V.

The article first reviews generative modeling as a unified probabilistic framework that models a high‑dimensional distribution P θ and optimizes its distance to the true data distribution. It outlines three essential components of generative models (network architecture, scale, and probabilistic method) and contrasts autoregressive models, which factorize the distribution sequentially, with diffusion models that employ a forward‑noise and reverse‑denoise stochastic differential equation.

Key advantages of diffusion LLMs are highlighted: (1) scalability, supported by Fisher consistency, simple next‑token loss, and the Transformer architecture; (2) instruction‑following and in‑context learning capabilities that arise from any well‑defined conditional generative model, not only autoregressive ones; (3) compression‑as‑intelligence, since maximum likelihood estimation corresponds to lossless data compression; and (4) bidirectional modeling, enabling superior performance on reverse‑direction tasks such as poetry line reversal, where LLaDA outperforms GPT‑4o.

The core methodology is described in detail. During pre‑training, a forward process progressively masks tokens from time t = 0 to 1, while the reverse process iteratively denoises a fully masked sequence. The loss is derived as an upper bound on the negative log‑likelihood, implemented via cross‑entropy prediction of clean tokens. Supervised fine‑tuning follows the same loss but without adding noise to the prompt.

LLaDA‑V extends LLaDA to multimodal scenarios by integrating a visual encoder (SigLIP 2), an MLP projector, and the LLaDA language tower. Visual features are projected into the language embedding space, and the model is trained with a masked‑response objective that computes cross‑entropy only on masked tokens. Inference proceeds via a reverse diffusion process that starts from a fully masked reply and iteratively restores tokens, using a low‑confidence re‑masking strategy to improve generation quality.

Experimental results show that LLaDA matches autoregressive models like LLaMA‑3‑8B on pure language tasks and surpasses them on multimodal benchmarks (e.g., MMMU, MMStar). LLaDA‑V achieves state‑of‑the‑art performance among both pure diffusion and mixed autoregressive‑diffusion multimodal models, demonstrating strong data scalability and competitive visual‑language understanding.

In conclusion, diffusion‑based LLMs prove that the core capabilities of large language models do not depend exclusively on autoregressive generation. The work opens a new research direction for multimodal AI, suggesting that diffusion architectures will play an increasingly important role in future AI systems.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

multimodal AI large language models diffusion models Generative Modeling instruction following

Written by

AntTech

Technology is the core driver of Ant's future creation.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.