Tagged articles

13 articles

Page 1 of 1

May 8, 2026 · Artificial Intelligence

How OpenAI’s MRC Protocol Redesigns Communication for 100,000‑GPU Clusters

OpenAI, together with AMD, Broadcom, Intel, Microsoft and Nvidia, introduced the Multipath Reliable Connection (MRC) protocol, which splits a single 800 Gb/s link into eight 100 Gb/s planes, enabling full‑mesh connectivity for over 100 k GPUs with fewer switches, lower cost, higher resilience, and dynamic load‑balancing that eliminates congestion and hardware‑failure impacts during large‑scale AI training.

AI networkingGPU clustersMRC

0 likes · 12 min read

How OpenAI’s MRC Protocol Redesigns Communication for 100,000‑GPU Clusters

Architects' Tech Alliance

Apr 22, 2026 · Industry Insights

Why AI Supernodes and 10,000‑GPU Clusters Will Dominate 2025

The article analyzes how AI supernodes, massive GPU clusters, knowledge‑base activation, embodied intelligence, optical interconnect and open‑source agents like OpenClaw together form a complete AI industry ecosystem in 2025, highlighting performance breakthroughs, domestic competition, market share shifts, and emerging security concerns.

AI supernodesGPU clustersKnowledge Base

0 likes · 16 min read

Why AI Supernodes and 10,000‑GPU Clusters Will Dominate 2025

Baidu Intelligent Cloud Tech Hub

Sep 9, 2025 · Artificial Intelligence

How Baidu Built a 32,000‑Card AI Super‑Compute Cluster and Boosted Efficiency by 50%

This article details Baidu Intelligent Cloud's journey in designing, constructing, and operating a 32,000‑card hybrid AI compute cluster, covering challenges in power, cooling, networking, multi‑cluster scheduling, and security, and explains how innovative hardware, software, and operational strategies achieved over 50% MFU improvement and industry‑first performance records.

AI infrastructureGPU clustershybrid cloud

0 likes · 15 min read

How Baidu Built a 32,000‑Card AI Super‑Compute Cluster and Boosted Efficiency by 50%

Architects' Tech Alliance

Jul 23, 2025 · Artificial Intelligence

Why Do AI Large‑Model Training Clusters Need Specialized Network Topologies?

The article explains how AI large‑model training demands massive GPU resources and how carefully designed network architectures—such as Clos/Fat‑Tree, Spine‑Leaf, multi‑rail versus single‑rail connections, Dragonfly, and Torus—impact performance, scalability, cost, and reliability, guiding the selection of optimal data‑center networks.

AIData CenterGPU clusters

0 likes · 9 min read

Why Do AI Large‑Model Training Clusters Need Specialized Network Topologies?

Kuaishou Tech

Nov 21, 2024 · Artificial Intelligence

Best Practices for Training Large Language Models on Ultra‑Large Scale Clusters

This article summarizes the challenges of distributed training for massive language models and presents a suite of solutions—including DP/TP/PP overlap, context parallelism, efficient recomputation, and a performance‑aware cost model—that together boost training throughput by over 30% on large GPU clusters.

GPU clustersPerformance Modelingactivation rematerialization

0 likes · 27 min read

Best Practices for Training Large Language Models on Ultra‑Large Scale Clusters

Architects' Tech Alliance

Sep 15, 2024 · Industry Insights

How to Build a Super‑Scale AI Cluster: From GPU Power to DPU‑Driven Architecture

This article analyzes the technical roadmap for upgrading AI super‑large GPU clusters to support trillion‑parameter multimodal models, covering single‑chip performance, super‑node scaling, DPU‑based compute fusion, energy‑efficient designs, converged storage, high‑throughput networking, and fault‑tolerant checkpoint strategies.

AI computeDPUGPU clusters

0 likes · 18 min read

How to Build a Super‑Scale AI Cluster: From GPU Power to DPU‑Driven Architecture

Architects' Tech Alliance

Sep 8, 2024 · Artificial Intelligence

Design and Architecture of Multi‑Million GPU Clusters for Large‑Scale AI Model Training

The article surveys the network architectures and congestion‑control techniques used in massive GPU clusters—such as Byte’s megascale, Baidu HPN, Alibaba HPN7, and Tencent Xingmai 2.0—highlighting how high‑bandwidth, low‑latency designs and advanced RDMA technologies enable training of trillion‑parameter multimodal AI models.

Data CenterGPU clustersHPN

0 likes · 11 min read

Design and Architecture of Multi‑Million GPU Clusters for Large‑Scale AI Model Training

Architects' Tech Alliance

Jul 1, 2024 · Industry Insights

Why Fat-Tree, Dragonfly, and Torus Topologies Matter for HPC Networks

The article analyzes three major high‑performance‑computing network topologies—Fat‑Tree, Dragonfly, and Torus—detailing their design principles, scalability formulas, routing strategies, advantages, and limitations to help architects choose the most suitable architecture for large‑scale GPU clusters.

DragonflyFat-TreeGPU clusters

0 likes · 13 min read

Why Fat-Tree, Dragonfly, and Torus Topologies Matter for HPC Networks

Architects' Tech Alliance

May 23, 2024 · Cloud Computing

Design and Comparison of High‑Performance Cloud Data Center Networks for AI Computing

This article analyzes traditional cloud data center network limitations for AI workloads and compares various high‑bandwidth, low‑latency architectures—including two‑layer and three‑layer fat‑tree designs, InfiniBand, and RoCE—providing best‑practice recommendations for building scalable, non‑blocking AI‑Pool networks.

AI computingFat-TreeGPU clusters

0 likes · 12 min read

Design and Comparison of High‑Performance Cloud Data Center Networks for AI Computing

Architects' Tech Alliance

Apr 6, 2024 · Artificial Intelligence

How ByteDance Scaled LLM Training to Over 10,000 GPUs: Inside the MegaScale System

The article analyzes ByteDance and Peking University's MegaScale system that enables efficient, stable training of large language models on clusters exceeding ten thousand GPUs, detailing algorithmic tweaks, 3D parallel communication overlap, operator optimizations, data‑pipeline improvements, network tuning, and fault‑tolerance mechanisms that together achieve a 55.2% MFU on a 175B model.

Distributed SystemsGPU clustersLLM training

0 likes · 15 min read

How ByteDance Scaled LLM Training to Over 10,000 GPUs: Inside the MegaScale System

Baidu Intelligent Cloud Tech Hub

May 9, 2023 · Artificial Intelligence

How Baidu’s High‑Performance GPU Cluster Powers the Next Generation of Large‑Scale AI Models

This article explains how Baidu built a massive, high‑performance GPU/IB cluster, optimized its architecture and software stack, and integrated AI frameworks and resource management to overcome compute, memory, and communication bottlenecks, enabling efficient training of trillion‑parameter large models.

AI infrastructureCloud ComputingGPU clusters

0 likes · 19 min read

How Baidu’s High‑Performance GPU Cluster Powers the Next Generation of Large‑Scale AI Models

Tencent Cloud Developer

Mar 22, 2023 · Artificial Intelligence

Tencent Star Network: High‑Performance GPU Cluster Architecture for Large‑Scale AI Model Training

Tencent’s Star Network delivers a 1.6 Tbps Ethernet‑RDMA fabric, fat‑tree topology supporting up to 4 K GPUs, multi‑track traffic aggregation and adaptive heterogeneous links plus a custom TCCL library, cutting AllReduce overhead from 35 % to 3.7 %, speeding AI training iterations by 32 % while automating deployment and providing sub‑second self‑healing.

AI trainingDistributed computingGPU clusters

0 likes · 19 min read

Tencent Star Network: High‑Performance GPU Cluster Architecture for Large‑Scale AI Model Training

Baidu Geek Talk

Mar 21, 2023 · Artificial Intelligence

Infrastructure Challenges and Solutions for Large‑Scale AI Model Training

The article explains how the massive compute and storage demands of today’s large language models create a “compute wall” and “storage wall,” and describes Baidu Intelligent Cloud’s four‑layer full‑stack infrastructure—combining advanced parallelism techniques, optimized GPU networking, static‑graph compilation, and cost‑model‑driven placement—to train trillion‑parameter models efficiently.

AI infrastructureCost ModelGPU clusters

0 likes · 27 min read

Infrastructure Challenges and Solutions for Large‑Scale AI Model Training