Performance Analysis of NVIDIA H20 and L20 AI Inference Chips
This article evaluates NVIDIA's China‑specific H20 and L20 inference chips, comparing their compute and memory‑bandwidth characteristics against A100, H100 and H200, and shows how they achieve superior throughput in large‑model inference despite reduced specifications.
NVIDIA has released China‑specific AI accelerator chips H20 and L20, whose advertised compute specifications (FP16, INT8) are roughly half of the A100 and one‑seventh of the H100, leading to concerns about their performance.
Theoretical calculations, however, indicate that both H20 and L20 deliver strong inference performance, with H20 surpassing A100 and H100 and only slightly trailing the H200.
Benchmark tests using a single H20, A100, H100, and H200 on the Llama2‑13B model (FP16, batch size = 16) measured token throughput across three input/output token configurations. Averaging the results, H20 achieved 1.8× the inference speed of A100 and 1.1× that of H100.
The Prefill stage, which is compute‑intensive, shows H20’s weaker compute capability leading to higher latency compared with the other chips, meaning users experience a longer wait before the first token appears.
In contrast, the Decode stage is memory‑bandwidth‑intensive, where H20’s higher memory bandwidth allows it to generate tokens faster than A100 and H100, resulting in a token generation rate of about 57 tokens/s, well above typical human reading speeds.
Overall, despite a longer initial latency, H20 delivers high throughput and favorable cost‑performance, making it a competitive choice for many inference workloads when priced similarly to H100.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.