Cloud Native 37 min read

Challenges and Solutions for AI Storage Systems in Cloud‑Native Training

The talk outlines how AI training’s growing data and compute demands create storage bottlenecks across four evolutionary stages, identifies four core problems—massive data, data‑flow, resource scheduling, and compute acceleration—and proposes hardware, software (parallel file systems, caching), and cloud‑native orchestration (Fluid, Baidu Canghai) solutions that combine object‑storage lakes with high‑performance acceleration layers to achieve near‑full GPU utilization.

Baidu Geek Talk
Baidu Geek Talk
Baidu Geek Talk
Challenges and Solutions for AI Storage Systems in Cloud‑Native Training

AI applications put comprehensive pressure on storage systems, from accelerating data‑near‑compute to managing massive data lakes and coordinating resource scheduling. Inefficiencies at any stage can delay AI tasks and cause contention among multiple AI workloads.

This talk examines the entire AI training workflow from a storage perspective, outlines the evolution of enterprise AI training infrastructure in four stages, and highlights the recurring storage problems that accumulate as scale grows.

Stage 1 : Small models and datasets run on single machines; storage is local memory or disks. Stage 2 : Model and data size outgrow a single node, prompting multi‑node training and the adoption of commercial network storage. Stage 3 : Scale further increases, leading to training platforms that separate compute and storage. Enterprises face a mix of high‑capacity, low‑cost storage for bulk data and high‑performance storage for hot training data. Stage 4 : In the cloud‑native era, the same “large‑capacity + high‑performance” combination persists, but data flow reverses: the data lake becomes the central source, and data must be moved to an acceleration layer before training.

The cloud‑native AI training stack consists of a data‑lake storage layer (large capacity, high throughput, low cost) and an acceleration layer that supplements the lake for high‑performance compute. Images illustrate the architecture:

The analysis identifies four key problems in AI training:

① Massive data – selecting a data‑lake that can reliably store petabytes of data.

② Data flow – moving data from the lake to a fast acceleration layer.

③ Resource scheduling – coordinating storage resources with the training scheduler.

④ Compute acceleration – ensuring storage does not become the bottleneck.

The presentation then reverses the order, starting from compute acceleration and drilling down to data flow, illustrating how each problem can be tackled.

Compute acceleration is broken down into shuffle, batch reads, and checkpointing. Shuffle creates a pure waiting period; batch reads can be overlapped with GPU computation via DataLoader pipelines, reducing read‑wait time to near zero. Checkpointing is a sequential write that usually has minor impact.

Key observations:

Metadata operations (open, stat, close) dominate when training on millions of small files.

Optimizations include packing small files into TFRecord/HDF5, maintaining a file‑list for shuffle, and using parallel file systems to reduce metadata latency.

Three optimization directions are proposed:

Reduce the problem’s difficulty (e.g., convert metadata ops to data ops by using list files or packed formats).

Strengthen the hardware side (more memory, faster SSDs, 100/200 Gbps networking).

Bring storage closer to compute (GPU Direct Storage, caching on local node memory/disk).

Software solutions focus on parallel file systems (e.g., Lustre, BeeGFS, GPFS, Baidu PFS) and caching systems (Alluxio, JuiceFS, RapidFS). Parallel file systems sacrifice standard protocols for private, kernel‑level clients to achieve minimal I/O path latency, while caching systems provide near‑compute data placement for read‑only AI workloads.

Resource scheduling is addressed by the Fluid framework, which extracts dataset preparation as a separate step, allocates resources via Kubernetes, pre‑warms metadata and data, and optionally enforces affinity between cache and compute nodes to improve performance and fault tolerance.

Data‑lake versus object‑storage comparison shows that object storage offers lower cost, better scalability (flat namespace, EB‑scale), higher availability, and superior throughput for massive data, making it the preferred choice for the lake layer.

The final Baidu “Canghai” storage solution combines BOS object storage (data lake), a high‑performance acceleration layer, Parallel File System (PFS), and RapidFS cache. Bucket Link synchronizes data between the lake and acceleration layer, and Fluid integrates scheduling. The solution addresses all four problems:

Massive data – BOS and its ecosystem.

Data flow – Bucket Link automatic sync.

Resource scheduling – Fluid‑driven PFS/RapidFS.

Compute acceleration – PFS/RapidFS provide low‑latency access.

Experimental results demonstrate that using RapidFS or PFS with pre‑warming achieves near‑100 % GPU utilization, whereas direct object‑storage training suffers significant I/O stalls.

Q&A sections cover topics such as why object storage enables compute‑storage separation, criteria for storage selection, bridging on‑prem and cloud storage, differences between Ceph and HDFS, and when to use PFS versus RapidFS.

For the full video replay and additional reading, see the links provided at the end of the talk.

cloud nativePerformance OptimizationAIcachingstoragedata lakeparallel file system
Baidu Geek Talk
Written by

Baidu Geek Talk

Follow us to discover more Baidu tech insights.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.