Baidu Intelligent Cloud Tech Hub
Author

Baidu Intelligent Cloud Tech Hub

We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.

133
Articles
0
Likes
189
Views
0
Comments
Recent Articles

Latest from Baidu Intelligent Cloud Tech Hub

100 recent articles max
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
May 31, 2024 · Artificial Intelligence

How Multi‑Chip Heterogeneous Clusters Power Next‑Gen Large Model Training

Using a martial‑arts analogy, the article explains why training massive AI models now requires thousands of GPUs or mixed‑chip clusters, outlines three key steps—inter‑connect, distributed parallel strategies, and accelerator acceleration—and shows how Baidu’s Baige platform achieves near‑full efficiency across GPU, Kunlun and Ascend chips.

AI trainingGPU interconnectaccelerator optimization
0 likes · 11 min read
How Multi‑Chip Heterogeneous Clusters Power Next‑Gen Large Model Training
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
May 27, 2024 · Databases

Baidu’s Enterprise Vector Database: Architecture, Performance, and RAG Secrets

An exclusive interview with Baidu’s senior database architects reveals the motivations behind building a dedicated enterprise vector database, details its novel column‑store engine, C++‑based retrieval stack, performance gains over open‑source solutions, multi‑modal support, RAG integration, and future research directions.

AIRAGStorage Engine
0 likes · 28 min read
Baidu’s Enterprise Vector Database: Architecture, Performance, and RAG Secrets
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
May 15, 2024 · Artificial Intelligence

How Baidu’s AIAK‑LLM Supercharges Large‑Model Training and Inference

The article explains the scaling challenges of ever‑larger LLMs, introduces the MFU performance metric, surveys industry parallelism and memory‑saving techniques, and details Baidu’s AIAK‑LLM suite—including resource, component and acceleration layers—as well as concrete training and inference optimizations that raise MFU by 30‑60% and cut deployment costs.

AI infrastructureMFUMemory Optimization
0 likes · 25 min read
How Baidu’s AIAK‑LLM Supercharges Large‑Model Training and Inference
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Apr 24, 2024 · Artificial Intelligence

How to Build and Accelerate Multi‑Chip AI Clusters for Large‑Model Training

With AI training demands outgrowing single‑chip GPU clusters, this article explains how to construct and speed up heterogeneous AI clusters—combining GPUs, Kunlun, and Ascend chips—by addressing interconnect, distributed parallel strategies, and specialized acceleration suites to achieve high MFU and efficient large‑model training.

AI clusteringGPU AccelerationMFU
0 likes · 15 min read
How to Build and Accelerate Multi‑Chip AI Clusters for Large‑Model Training
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Apr 16, 2024 · Operations

Tackling Multi-CPU Performance Challenges with Baidu’s One-Click Btune

At QCon 2024, Baidu Intelligent Cloud presented the complexities of optimizing diverse CPU architectures in data centers and introduced Btune, a one‑click solution that automates bottleneck detection, analysis, and performance tuning across Intel, AMD, and ARM platforms, enabling engineers to boost service efficiency.

BtuneCPU performanceCloud Computing
0 likes · 18 min read
Tackling Multi-CPU Performance Challenges with Baidu’s One-Click Btune
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Mar 1, 2024 · Artificial Intelligence

How Baidu’s BCCL Boosts Distributed AI Training with Real‑Time Observability and Fault Diagnosis

Baidu’s Collective Communication Library (BCCL) enhances large‑model distributed training by improving real‑time bandwidth monitoring, fault diagnosis, network stability, and performance, leveraging RDMA networks and GPU‑specific optimizations to increase effective training time to 98% and bandwidth utilization to 95%.

AI infrastructureFault DiagnosisObservability
0 likes · 11 min read
How Baidu’s BCCL Boosts Distributed AI Training with Real‑Time Observability and Fault Diagnosis