Tag

Model Training Optimization

0 views collected around this technical thread.

Baidu Geek Talk
Baidu Geek Talk
May 10, 2023 · Artificial Intelligence

Baidu's AI Infrastructure for Large-Scale LLM Training: Architecture, Challenges, and Optimization

Baidu’s AI infrastructure combines a massive InfiniBand‑linked GPU cluster, Kunlun chips, the PaddlePaddle framework, and the Wenxin model suite with 4D hybrid parallelism, elastic fault tolerance, and a two‑stage training pipeline to overcome computation, memory, and communication walls, delivering world‑leading MLPerf performance for large‑scale LLMs.

AI infrastructureGPU ClusterInfiniBand
0 likes · 15 min read
Baidu's AI Infrastructure for Large-Scale LLM Training: Architecture, Challenges, and Optimization