Artificial Intelligence 19 min read

Tencent Star Network: High‑Performance GPU Cluster Architecture for Large‑Scale AI Model Training

Tencent’s Star Network delivers a 1.6 Tbps Ethernet‑RDMA fabric, fat‑tree topology supporting up to 4 K GPUs, multi‑track traffic aggregation and adaptive heterogeneous links plus a custom TCCL library, cutting AllReduce overhead from 35 % to 3.7 %, speeding AI training iterations by 32 % while automating deployment and providing sub‑second self‑healing.

Tencent Cloud Developer

Mar 22, 2023

Tencent Star Network: High‑Performance GPU Cluster Architecture for Large‑Scale AI Model Training

Recent breakthroughs in AIGC (ChatGPT, code generation, novel writing, etc.) rely on massive large‑model training that requires long‑running, large‑scale GPU clusters. The performance, reliability and cost of the underlying network become critical bottlenecks.

This article introduces the three mainstream GPU‑cluster network routes in the industry and presents Tencent’s own solution – the Star Network – which is designed to meet the extreme demands of AI training workloads.

Key technical features of the Star Network :

1.6 Tbps ultra‑high‑bandwidth Ethernet RDMA fabric, providing more than 10× communication speedup for AllReduce and All‑to‑All patterns.

Fat‑Tree topology with support for up to 4 K GPUs per cluster (scalable to 64 K GPUs).

Multi‑track traffic aggregation that groups NICs belonging to the same rack into a single ToR switch, achieving >80 % traffic aggregation efficiency.

Heterogeneous network adaptive communication that jointly exploits inter‑node (NIC + switch) and intra‑node (NVLink/NVSwitch) links, delivering ~30 % performance gain for All‑to‑All at typical message sizes.

Custom collective communication library (TCCL) built on NCCL, tuned for the Star hardware, delivering ~40 % acceleration for AllReduce, AllGather and ReduceScatter.

Performance measurements on GPT‑3 and T5‑MoE models show that the 1.6 Tbps fabric reduces communication overhead from 35 % to 3.7 % (AllReduce) and cuts iteration time by 32 %, effectively increasing cluster compute power by 48 %.

Beyond raw bandwidth, the solution includes a fully automated deployment pipeline that integrates NUMA, PCIe, NVSwitch, NIC and switch configuration, provides one‑click fault localization, and supports automatic health monitoring via Service Telemetry.

Operational features:

End‑to‑end network deployment integration reduces cluster rollout time from 19 days to 4.5 days with 100 % configuration accuracy.

One‑click fault diagnosis distinguishes between network‑side and application‑side issues, automatically isolates problematic NICs or switches, and triggers deterministic path switching using a hash‑offset algorithm to achieve sub‑second self‑healing.

Comprehensive validation steps (hardware checks, RDMA tests, collective library benchmarks, model‑level reliability tests) ensure that only fully verified clusters are delivered.

Looking forward, the Star Network will be offered as a public‑cloud service on Tencent Cloud, paired with the A800 HCC 1.6 T instance, and will continue to evolve in bandwidth, heterogeneous communication, custom libraries and intelligent monitoring to support ever‑larger AI models.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Distributed computing RDMA AI training GPU clusters high‑performance networking

Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.