Artificial Intelligence 25 min read

How Baidu Baige Achieves Near‑Zero Downtime in Massive AI Model Training

The article examines how Baidu Baige evolved AI training stability from manual operations to precise engineering, detailing metrics, fault‑perception techniques, eBPF‑based diagnostics, multi‑level restart strategies, and trigger‑based checkpointing that together achieve sub‑minute recovery and 99.5% effective training time on massive GPU clusters.

Baidu Intelligent Cloud Tech Hub

Mar 10, 2025

How Baidu Baige Achieves Near‑Zero Downtime in Massive AI Model Training

1. Evolution of AI Training Stability

In 2012 the ImageNet competition saw AlexNet, launching modern AI. Ten years later GPU clusters grew from a few servers to thousands of cards requiring dedicated power. During this explosive compute growth, training system stability management has shifted from simple operations to precise engineering.

1.1 Early small‑model era: manual operations golden age

Before 2022 AI training resembled a handcrafted workshop. Most tasks used a dozen GPUs with PyTorch or TensorFlow data parallelism. Engineers often preferred restarting over troubleshooting. Monitoring was like a car dashboard, showing basic task status. When a job hung, engineers inspected logs; if GPU errors appeared, they called ops with the “NVIDIA three‑tool kit” (nvidia‑smi, dcgm, nsys) to check temperature, power, etc. This simple workflow handled tens‑of‑cards clusters.

1.2 The large‑model storm: quantitative to qualitative impact

ChatGPT opened a new era, and scaling to thousand‑ or ten‑thousand‑card clusters exposed the inadequacy of previous ops—like using a small net to catch a whale.

Example: early 2024 Baidu Baige helped an AIGC startup expand from hundreds to thousands of cards. After a few days a training job hung for hours. Lack of fault perception and tolerance delayed detection until the next day, wasting dozens of GPU‑hours. Logs only showed a timeout, and monitoring reported normal node status.

The engineer retried the job, which hung again, forcing support involvement. Diagnosis revealed a silent node fault (SDC) causing the hang, and resolution took ~30 hours, costing massive compute.

2. Baidu Baige’s panoramic view of training stability

Today training stability is a core infrastructure, like seismic reinforcement in a skyscraper. As clusters move toward tens of thousands of cards, this invisible armor determines AI progress.

In 2024 Baidu Baige introduced the metric “invalid training time” and aims to minimize it. The formula is:

Invalid training time = number of fault interruptions × fault recovery duration + total checkpoint write time.

Where fault recovery duration = fault perception recall time (auto/manual) + task scheduling time + task initialization time + task recompute time.

Reducing invalid time requires focusing on two dimensions: infrastructure stability and task fault tolerance, addressing three key areas:

Improve infrastructure delivery quality.

Increase fault‑tolerance recall, precision, and timeliness.

Optimize checkpoint mechanisms to cut save and recompute time.

Through fault‑tolerant architecture, Baidu Baige built an end‑to‑end automatic anomaly perception, diagnosis, and recovery stack covering task load, framework, communication, and infrastructure, achieving >90 % coverage of training anomalies, with sub‑second perception, minute‑level localization, and an average 3‑minute self‑healing time.

3. Infrastructure delivery quality assurance

Stable infrastructure is the foundation of reliability.

In the CPU era, pre‑delivery tests focused on CPU compute and network stress, without business‑level evaluation. Faults were rare and handled via ticket‑based replacement.

In the GPU era, delivery must consider CPU, GPU, RDMA, storage, power, temperature, etc. Post‑delivery, high‑load GPUs fail more often, and customers demand rapid fault detection and replacement.

Baidu Baige’s delivery process includes:

Pre‑delivery: >200 metric checks, 48‑hour burn‑in, NCCL‑Test intra‑ and inter‑node bandwidth benchmarks, and end‑to‑end large‑model training and inference performance tests.

Post‑delivery: Real‑time node fault perception, periodic inspections, and tiered self‑healing (automatic drain/restart for error‑level faults, automatic replacement for fault‑level faults).

4. Task fault tolerance precision and recall

Core to task stability is robust fault tolerance that quickly recovers from any failure.

First, accurately detect anomalies, then diagnose and locate them, finally automate recovery.

Fault tolerance requires both explicit and implicit fault detection. Explicit faults are easy to recall via a knowledge base of error patterns combined with hardware sensing (HAS Agent), achieving >95 % recall. Implicit faults, such as silent hangs, need extensive experience to identify.

4.1 Automatic hang perception

When a training task hangs, most frameworks report a timeout error (e.g., NCCL watchdog). PyTorch defaults to a 10‑minute timeout, Megatron‑LM to 30 minutes, which is unacceptable at massive scale.

[E ProcessGroupNCCL.cpp:828] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15173, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1802710 milliseconds before timing out.

Improved perception methods include:

Log‑based detection: If all workers’ logs stop updating for less than the framework timeout, the job is likely hung.

Stack‑trace sampling: Tools like py‑spy/pystack sample stacks; unchanged stacks across workers for several minutes indicate a hang.

Metric anomalies: RDMA traffic drops to zero while most GPUs stay at 100 % utilization, and a few GPUs show 0 % utilization.

Communication‑library probes: Baidu’s BCCL timestamps each collective; missing timestamps across ranks signal a hang.

These probes feed a central master component for further diagnosis.

4.2 Automatic hang diagnosis

The master aggregates probe data and applies a time‑window (e.g., 5 minutes) rule: if at least two of log, metric, stack, or communication signals indicate inactivity, the job is deemed hung.

To locate the offending node, Baidu Baige uses:

BCCL Tracehang: Cross‑rank communication gaps identify the source rank.

RDMA/GPU metrics: Zero GPU utilization on a rank while others run at full load points to the culprit.

Stack divergence: Ranks with call stacks differing from the majority are likely hung.

Combined analysis to pinpoint the root cause and correlate with hardware fault perception.

4.3 eBPF‑based implicit fault perception and diagnosis

Traditional user‑space monitoring misses kernel‑level anomalies. Baidu Baige leverages eBPF probes to capture system calls, network traffic, CPU scheduling, and framework function latencies without instrumenting user code.

Four event classes are tracked:

Training‑critical function latency (forward, backward, collective ops).

Process‑schedule blocking (sched_switch, TASK_UNINTERRUPTIBLE > 5 s).

CUDA runtime API latency via uprobe on libcuda.so.

RDMA verbs monitoring (ibv_post_send, ibv_poll_cq) for communication delay.

Analysis combines baseline comparison and cross‑rank consistency detection, identifying silent kernel stalls, lock contention, or NUMA‑related latency spikes.

Using these methods, Baidu Baige reduced implicit‑fault detection time from minutes to seconds and improved diagnosis accuracy by over 40 %.

5. Ensuring timely task fault recovery

Recovery speed reflects how quickly a task returns to training after a fault, minimizing wasted compute. Two factors matter: average interruption time and recompute time.

Multi‑level restart strategies reduce interruption time:

Explicit single‑node faults: replace the node and mask it at the cluster level.

Implicit single‑node faults: replace the node and mask it at the task level.

Multi‑node faults: attempt in‑place restart; if unsuccessful, resubmit the whole job.

These strategies have cut typical recovery from 30 minutes to under 30 seconds with >95 % success.

5.2 Trigger‑based checkpointing to cut recompute time

Traditional fixed‑interval checkpoints waste storage and may lose progress. Trigger‑based checkpoints save state on specific events (faults, OOM, etc.), balancing storage cost and recovery speed. “Zero‑repeat” checkpoints (every step) eliminate recompute but are storage‑heavy.

Key techniques for efficient trigger‑based checkpoints:

Integrate fault perception to auto‑trigger saves before process exit.

Asynchronous dump to shared memory, then RDMA‑fast transfer to new nodes.

Periodic redundant backups to guard against catastrophic node crashes.

Combined with incremental async checkpoints, this approach improves safety while reducing recompute overhead.

6. Business demands for stability

AI training stability has become a precision engineering discipline. From manual restarts to automated perception and rapid recovery, each advance mirrors the exponential growth of compute.

Looking ahead to hundred‑thousand‑card clusters, future solutions may combine eagle‑sharp fault detection with goose‑flock‑style resource scheduling, balancing sub‑second recovery with petabyte‑scale storage costs.

Currently Baidu Baige achieves 99.5 % effective training time on thousand‑ and ten‑thousand‑card clusters, supporting flagship models such as the domestic mathematics model “Jiuzhang” and the Sora‑like model “Vidu”.

— END —

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

fault tolerance eBPF AI training Large-Scale Clusters checkpointing

Written by

Baidu Intelligent Cloud Tech Hub

We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.