Artificial Intelligence 13 min read

Why Network Bandwidth Is the Real Bottleneck for AIGC and How DDC Solves It

The article explains how AIGC models demand massive GPU compute, why network bandwidth and latency become the critical limiting factors, and how the Distributed Disaggregated Chassis (DDC) architecture addresses these challenges with scalable, high‑throughput networking solutions.

Efficient Ops

Jun 11, 2023

Why Network Bandwidth Is the Real Bottleneck for AIGC and How DDC Solves It

2023 marked a breakout year for AI, with AIGC models such as ChatGPT, GPT‑4, and Wenxin Yiyan showcasing extraordinary content‑generation capabilities.

Beyond the models themselves, the underlying communication technology is crucial; a powerful network is required to support AIGC operations, and the AI wave will transform traditional networking.

AIGC: How Much Compute Is Required?

Data, algorithms, and compute power are the three fundamentals of AI. Modern AIGC models have grown from billions to trillions of parameters, requiring tens of thousands of GPUs. For example, ChatGPT was trained on Microsoft’s super‑computing infrastructure using about 10,000 V100 GPUs, consuming roughly 3,640 PF‑days of compute.

A single V100 provides 0.014 PFLOPS; ten thousand of them deliver 140 PFLOPS, meaning ideal training would finish in about 26 days. Accounting for realistic GPU utilization (≈33 %), training extends to roughly 78 days, highlighting the impact of GPU utilization on training time and cost.

The biggest factor affecting GPU utilization is the network. Large GPU clusters need massive bandwidth for data exchange, and insufficient network performance forces GPUs to wait, lowering utilization and increasing training time.

What Kind of Network Can Support AIGC?

Traditional high‑performance networking solutions include:

Infiniband – offers ultra‑high bandwidth and low latency but is expensive and vendor‑locked.

RDMA – Remote Direct Memory Access eliminates CPU involvement, boosting throughput and reducing latency; today it is commonly deployed over Ethernet using RoCE v2, together with PFC and ECN for congestion control.

Box‑style switches – used by some internet companies, but suffer from limited scalability, high power consumption, and large fault domains.

Introducing Distributed Disaggregated Chassis (DDC)

DDC (Distributed Disaggregated Chassis) re‑architects the box‑style switch into a distributed system. The traditional backplane becomes a Network Cloud Fabric (NCF) device, while line cards become Network Cloud Packet (NCP) devices, interconnected by optical fibers. Management functions move to a Network Cloud Controller (NCC).

DDC scales flexibly: a single POD can host 96 NCPs with 400 G downlinks and 40 NCFs with 200 G uplinks, supporting 1,728 × 400 G ports and up to 216 AI servers (8 GPUs each). Multi‑POD designs expand this to thousands of ports.

Technical Highlights of DDC

VOQ + Cell forwarding – packets are placed in Virtual Output Queues, credits are exchanged, and data is sliced into Cells for load‑balanced delivery, dramatically reducing packet loss and latency.

PFC single‑hop deployment – avoids deadlock by treating the entire DDC fabric as a single switch, eliminating multi‑switch contention.

ECN support – congestion notifications trigger rate‑limiting before buffers overflow.

Distributed OS – replaces a centralized NCC with a distributed control plane, improving reliability and enabling SDN‑based management via standard interfaces (Netconf, gRPC).

Commercial Progress

Industry tests (e.g., OpenMPI All‑to‑All) show DDC‑based networks improve bandwidth utilization by ~20 % and GPU utilization by ~8 % over traditional designs. Companies such as Ruijie Networks have launched DDC products, including a 400 G NCP switch (RG‑S6930‑18QC40F1) and a 200 G NCF switch (RG‑X56‑96F1).

In summary, DDC offers superior scalability, cost‑effectiveness, and performance for AI‑driven workloads, addressing the network bottlenecks that accompany the rapid rise of AIGC and paving the way for broader digital transformation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

High Performance Computing networking AIGC GPU utilization AI Infrastructure DDC

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.