Artificial Intelligence 19 min read

Tencent Star Network: High‑Performance GPU Cluster Architecture for Large‑Scale AI Model Training

Tencent’s Star Network delivers a 1.6 Tbps Ethernet‑RDMA fabric, fat‑tree topology supporting up to 4 K GPUs, multi‑track traffic aggregation and adaptive heterogeneous links plus a custom TCCL library, cutting AllReduce overhead from 35 % to 3.7 %, speeding AI training iterations by 32 % while automating deployment and providing sub‑second self‑healing.

Tencent Cloud Developer
Tencent Cloud Developer
Tencent Cloud Developer
Tencent Star Network: High‑Performance GPU Cluster Architecture for Large‑Scale AI Model Training

Recent breakthroughs in AIGC (ChatGPT, code generation, novel writing, etc.) rely on massive large‑model training that requires long‑running, large‑scale GPU clusters. The performance, reliability and cost of the underlying network become critical bottlenecks.

This article introduces the three mainstream GPU‑cluster network routes in the industry and presents Tencent’s own solution – the Star Network – which is designed to meet the extreme demands of AI training workloads.

Key technical features of the Star Network :

1.6 Tbps ultra‑high‑bandwidth Ethernet RDMA fabric, providing more than 10× communication speedup for AllReduce and All‑to‑All patterns.

Fat‑Tree topology with support for up to 4 K GPUs per cluster (scalable to 64 K GPUs).

Multi‑track traffic aggregation that groups NICs belonging to the same rack into a single ToR switch, achieving >80 % traffic aggregation efficiency.

Heterogeneous network adaptive communication that jointly exploits inter‑node (NIC + switch) and intra‑node (NVLink/NVSwitch) links, delivering ~30 % performance gain for All‑to‑All at typical message sizes.

Custom collective communication library (TCCL) built on NCCL, tuned for the Star hardware, delivering ~40 % acceleration for AllReduce, AllGather and ReduceScatter.

Performance measurements on GPT‑3 and T5‑MoE models show that the 1.6 Tbps fabric reduces communication overhead from 35 % to 3.7 % (AllReduce) and cuts iteration time by 32 %, effectively increasing cluster compute power by 48 %.

Beyond raw bandwidth, the solution includes a fully automated deployment pipeline that integrates NUMA, PCIe, NVSwitch, NIC and switch configuration, provides one‑click fault localization, and supports automatic health monitoring via Service Telemetry.

Operational features:

End‑to‑end network deployment integration reduces cluster rollout time from 19 days to 4.5 days with 100 % configuration accuracy.

One‑click fault diagnosis distinguishes between network‑side and application‑side issues, automatically isolates problematic NICs or switches, and triggers deterministic path switching using a hash‑offset algorithm to achieve sub‑second self‑healing.

Comprehensive validation steps (hardware checks, RDMA tests, collective library benchmarks, model‑level reliability tests) ensure that only fully verified clusters are delivered.

Looking forward, the Star Network will be offered as a public‑cloud service on Tencent Cloud, paired with the A800 HCC 1.6 T instance, and will continue to evolve in bandwidth, heterogeneous communication, custom libraries and intelligent monitoring to support ever‑larger AI models.

distributed computingRDMAAI trainingGPU clustershigh-performance networking
Tencent Cloud Developer
Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.