Zhipu AI Unveils GLM-5.1-HighSpeed, Achieving 400 Tokens/s and 6× Faster Generation

On May 22 2026, Zhipu AI released the GLM‑5.1‑HighSpeed variant, which generates up to 400 tokens per second—over six times the speed of the standard GLM‑5.1 and twice that of Google’s Gemini‑3.5‑Flash—thanks to multi‑dimensional inference, attention and sequence‑parallel optimizations while preserving full model capabilities.

ZhiKe AI
ZhiKe AI
ZhiKe AI
Zhipu AI Unveils GLM-5.1-HighSpeed, Achieving 400 Tokens/s and 6× Faster Generation

On May 22, 2026, Zhipu AI announced the GLM‑5.1‑HighSpeed variant, a fast‑track version of its GLM‑5.1 large language model.

Official measurements show the model can output 400 tokens in one second, delivering more than six times the throughput of the standard GLM‑5.1 and twice the speed of Google’s Gemini‑3.5‑Flash. The streaming output remains stable and coherent, with lower latency and higher throughput than typical large models.

The speed gain is not achieved by merely shrinking model parameters; Zhipu AI redesigned the underlying inference scheduler, refined the attention mechanism, and introduced multi‑dimensional sequence‑parallel computation. These changes accelerate processing while fully preserving the original model’s logical reasoning, copywriting, knowledge‑question answering, and long‑text handling abilities.

In the current global race for ultra‑fast large models, the launch of GLM‑5.1‑HighSpeed signals that domestically developed models have entered the first‑tier of real‑time inference, overtaking many overseas fast‑generation models.

As high‑speed LLMs become more widespread, low‑latency, high‑efficiency generation is expected to become a standard feature of future model products, potentially driving down token costs for end users.

Reference: https://docs.bigmodel.cn/cn/guide/models/text/glm-5.1-highspeed?webview_progress_bar=1&show_loading=0&push_animated=1&theme=light

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LLMInference OptimizationZhipu AIGLM-5.1-HighSpeedhigh-speed generation
ZhiKe AI
Written by

ZhiKe AI

We dissect AI-era technologies, tools, and trends with a hardcore perspective. Focused on large models, agents, MCP, function calling, and hands‑on AI development. No fluff, no hype—only actionable insights, source code, and practical ideas. Get a daily dose of intelligence to simplify tech and make efficiency tangible.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.