Artificial Intelligence 41 min read

What Makes DeepSeek V4 Different? A Deep Technical Dive into Its Innovations

DeepSeek V4 introduces a suite of architectural breakthroughs—including mixed‑expert MoE, manifold‑constrained hyper‑connections, CSA/HCA hybrid attention, and FP4 quantization—that slash inference cost by up to tenfold while delivering million‑token context, competitive benchmarks, dual model variants, and a disruptive pricing strategy.

Architect's Guide

May 29, 2026

What Makes DeepSeek V4 Different? A Deep Technical Dive into Its Innovations

Background and Evolution

DeepSeek, a Chinese AI research company founded in 2023, progressed from the 671B‑parameter MoE model V3 to the R1 release in early 2025, which matched OpenAI’s o1 performance and sparked a wave of open‑source model development in China. V3.2 extended the context window to 128K tokens, setting the stage for V4.

V4 Release Overview

On 24 April 2026 DeepSeek launched V4‑Pro (1.6 T total parameters, 49 B active parameters) and V4‑Flash (284 B total, 13 B active), both with a 1 M token context window and MIT‑licensed weights. The technical report, titled “DeepSeek‑V4: Towards Highly Efficient Million‑Token Context Intelligence,” details six core innovations.

Core Architectural Innovations

The model builds on a MoE backbone and introduces:

Mixed‑Expert MoE upgrades : larger expert count, reduced activation parameters, and a new routing scheme.

Manifold‑Constrained Hyper‑Connections (mHC) : multi‑stream residual connections projected onto the Birkhoff polytope via ~20 Sinkhorn‑Knopp iterations, preventing the 3,000× signal amplification seen in unconstrained designs.

Hybrid Attention (CSA + HCA + SWA) : three‑layer alternating attention—Sliding Window (local), Compressed Sparse Attention (moderate compression with top‑k selection), and Heavily Compressed Attention (128:1 compression for global context). This reduces the FLOPs for a 1 M token context to 10 % of V3.2 and KV cache size to 7 %.

Muon Optimizer : replaces AdamW for most parameters, using Nesterov momentum and Newton‑style updates for faster convergence and stability.

FP4 + FP8 Quantization‑Aware Training : expert weights stored in FP4, other weights in FP8, halving memory use without noticeable quality loss.

┌─────────────────────────────────────────────────────┐
│              DeepSeek V4 Architecture               │
├─────────────────────┬───────────────────────────────┤
│  Attention Layer    │  FFN Layer (MoE)                 │
│  ┌─────────────┐   │  ┌──────────────────────┐       │
│  │ Sliding    │   │  │   DeepSeekMoE        │       │
│  │ Window (SWA)│   │  │   ┌──────────┐      │       │
│  └─────────────┘   │  │   │ Shared   │      │       │
│  ───────────────── │  │   │ Experts  │      │       │
│  Compressed Sparse │  │   └──────────┘      │       │
│  Attention (CSA)   │  │   │ Routing  │      │       │
│  ───────────────── │  │   │ Experts │      │       │
│  Heavily Compressed│  │   └──────────┘      │       │
│  Attention (HCA)   │  │                      │       │
└─────────────────────┴───────────────────────────────┘

Training Stability Techniques

Two novel mechanisms address loss spikes in trillion‑parameter MoE training:

Anticipatory Routing : decouples backbone updates from routing updates by using parameters from a few steps earlier, breaking the feedback loop that amplifies instability.

SwiGLU Clamping : clamps linear components to [-10, 10] and gating components to 10, preventing extreme activations.

Two‑Stage Post‑Training Paradigm

V4 replaces the single‑stage RL fine‑tuning of V3.2 with a two‑stage process. First, domain‑specific expert models (coding, math, agentic, instruction, world knowledge, etc.) are trained with high‑quality SFT data and GRPO reinforcement learning. Second, Online Policy Distillation (OPD) merges these experts into a unified student model, preserving each domain’s “ceiling” performance while avoiding the averaging effect of traditional multi‑task training.

Agentic Capabilities

V4‑Pro supports a three‑level reasoning effort (Non‑Think, Think High, Think Max) and introduces Quick Instruction Tokens such as <|action|>, <|title|>, and <|query|> to enable efficient tool‑calling without redundant prefilling. The model integrates seamlessly with popular agent frameworks (Claude Code, OpenClaw/OpenCode, CodeBuddy) via OpenAI‑compatible and Anthropic‑compatible APIs.

Infrastructure and Deployment

V4‑Pro inference is optimized for clusters of NVIDIA H100/H200 GPUs, while V4‑Flash runs on a few H100 SXM cards. DeepSeek also announced support for Huawei Ascend 950 chips for inference, offering a path for Chinese‑based deployments. The SGLang + Miles stack provides specialized kernels (ShadowRadix cache, DeepGEMM Mega MoE, Flash Compressor) that reduce memory traffic and improve throughput.

Benchmark Results

Across a wide range of evaluations, V4‑Pro consistently outperforms V3.2 and rivals top‑tier closed‑source models. Highlights include:

MMLU‑Pro 5‑shot: 73.5 % (vs. 65.5 % for V3.2).

HumanEval 0‑shot: 76.8 % (vs. 62.8 % for V3.2).

LiveCodeBench: 93.5 % (top of the leaderboard, surpassing Claude Opus 4.6’s 88.8 %).

LongBench‑V2 1‑shot (1 M token context): 51.5 % for V4‑Pro vs. 40.2 % for V3.2.

However, the report candidly notes gaps: lower scores on SimpleQA‑Verified (57.9 % vs. Gemini‑3.1‑Pro’s 75.6 %) and modest advantages on the hardest reasoning benchmarks (HLE, IMO).

Pricing Strategy

DeepSeek adopts a disruptive cost model. For V4‑Pro, output tokens cost $3.48 /M, with cache‑hit input pricing at $0.145 /M and cache‑miss at $1.74 /M. V4‑Flash’s output price is $0.28 /M, with cache‑hit input at $0.028 /M. Compared to Claude Opus 4.6 ($25 /M output) and GPT‑5.4 ($15 /M output), DeepSeek’s pricing is 7–89 times cheaper, especially for workloads with repeated system prompts where cache‑hit costs become negligible.

Limitations and Future Work

The technical report openly lists several limitations:

Architectural complexity remains high; future versions aim to simplify the design.

Anticipatory Routing and SwiGLU Clamping lack solid theoretical grounding.

World‑knowledge benchmarks lag behind Gemini‑3.1‑Pro.

Retrieval accuracy degrades beyond 128 K tokens, affecting very long‑context tasks.

V4 is a preview release; further post‑training improvements are planned.

Compliance notes warn that using DeepSeek’s hosted API routes data through servers in China, which may raise data‑sovereignty concerns for regulated industries. The open‑source weights (MIT license) enable self‑hosting to mitigate such risks.

Outlook

DeepSeek envisions V5 with a streamlined architecture, deeper theoretical understanding of stability tricks, narrowed knowledge gaps, improved long‑context retrieval, and broader support for domestic chips during training. If successful, the open‑source‑to‑closed‑source performance gap will shrink further, making efficiency a primary competitive axis in AI.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Mixture of Experts agentic AI Pricing Strategy Efficient Attention DeepSeek V4 AI Model Benchmark FP4 Quantization

Written by

Architect's Guide

Dedicated to sharing programmer-architect skills—Java backend, system, microservice, and distributed architectures—to help you become a senior architect.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.