Tagged articles
3 articles
Page 1 of 1
Architects' Tech Alliance
Architects' Tech Alliance
Apr 27, 2026 · Artificial Intelligence

Why Huawei’s Ascend 950 PR and DT Have Different Names – The Technical Rationale

Huawei’s Ascend 950 series splits a single die into two variants—PR (Prefill & Recommendation) optimized for compute‑intensive inference with low cost, and DT (Decode & Training) tuned for memory‑bandwidth‑heavy generation and training—illustrating a scenario‑driven, P/D‑separated architecture that maximizes efficiency.

AI ChipAscend 950Decode
0 likes · 5 min read
Why Huawei’s Ascend 950 PR and DT Have Different Names – The Technical Rationale
ShiZhen AI
ShiZhen AI
Apr 2, 2026 · Artificial Intelligence

How KV Cache Works and Why Large Model Outputs Cost Five Times More Than Inputs

The article explains the KV Cache mechanism that stores previously computed key/value vectors to avoid redundant Transformer calculations, delivering roughly a 5× speedup, while also detailing why generating output tokens is far more expensive than processing input tokens due to serial generation and memory trade‑offs.

KV CacheLLM inferenceMemory Optimization
0 likes · 9 min read
How KV Cache Works and Why Large Model Outputs Cost Five Times More Than Inputs
Architect
Architect
Jul 2, 2024 · Artificial Intelligence

Mooncake: A Separated Architecture for Large‑Language‑Model Inference

The article presents Mooncake, a split‑architecture inference platform for the Kimi LLM assistant, detailing its three elastic resource pools, the rationale for using Time‑Between‑Tokens over TPOT, and design choices for Prefill, KVCache, and Decode stages to improve latency and throughput.

AI SystemsDecodeKVCache
0 likes · 9 min read
Mooncake: A Separated Architecture for Large‑Language‑Model Inference