Artificial Intelligence 9 min read

Building a Deep Learning Training Platform on Cloud: Challenges, Runonce Service, and Storage Optimization

iQIYI built a cloud‑based deep‑learning training platform called Jarvis, replacing the initial Runonce service, by containerizing GPU tasks, adopting Ceph S3 storage with FUSE, optimizing data pipelines, and addressing compute, storage, and networking challenges to improve scalability and reduce GPU idle time.

iQIYI Technical Product Team
iQIYI Technical Product Team
iQIYI Technical Product Team
Building a Deep Learning Training Platform on Cloud: Challenges, Runonce Service, and Storage Optimization

Deep learning has achieved great success in image processing, replacing many traditional algorithms. At iQIYI, the rapid growth of GPU‑intensive training tasks creates high cost pressure and demands efficient GPU provisioning. To lower the barrier for deep learning usage, the team embarked on building a cloud‑based deep learning platform.

The cloud platform addresses three core problems: compute, storage, and networking. For compute, iQIYI heavily uses virtualized and containerized resources. Containers are preferred because they provide near‑lossless GPU access without extra hardware, enable fast environment provisioning, and have short startup times.

Training Task Scenario

The first attempt focused on containerizing training tasks. A typical task reads data from a source, runs a framework API, and outputs checkpoints, models, logs, and events. A demo container was quickly built and presented to algorithm engineers, who asked whether they could use the service without understanding Docker.

Runonce Training Service

Runonce was designed to let users treat containers like virtual machines. It runs on a Mesos‑based container cloud, uses Ceph RBD as the root filesystem, and provides an sshd entry point with a tool for injecting SSH keys. While the service was simple to develop and easy to use, it suffered from several drawbacks: users could modify the system environment, leading to hard‑to‑track errors; low actual GPU utilization due to extensive shell‑based debugging; lack of shared storage, parallel tasks, and distributed training support.

Because of these unrecoverable issues, a new platform (named Jarvis) was planned. Observations from Runonce highlighted the typical workflow: dataset upload → write training code → execute task → retrieve results. The storage layer became the critical component.

Storage Choice

For data sharing, network storage is preferred. Design criteria include robustness, scalability, throughput, concurrency, and latency. Jarvis selected Ceph object storage (via S3) as the primary backend because of its high concurrency and horizontal scalability, despite requiring a custom file‑API wrapper.

Jarvis Storage

Jarvis mounts Ceph S3 using a FUSE filesystem. While reads are efficient, writes are limited (no random writes, whole‑file uploads required). Jarvis caches data locally and synchronizes to Ceph, using short sync intervals for small log files and longer intervals for large model files.

Jarvis Runtime Environment

Training containers use a custom image with a precisely defined software stack. The image startup script mounts S3 via FUSE, pulls code from GitLab, runs the user’s training job, and syncs outputs back to object storage. To accelerate development, the environment is split into a base image and a runtime script image, allowing script updates without rebuilding the base.

Despite added features such as real‑time log web pages and TensorBoard integration, GPU utilization did not improve significantly. Investigation revealed that most tasks did not optimize data loading, which becomes more critical in a network‑storage setting.

Storage Optimization

Key recommendations for network storage include: high throughput and concurrency, but higher latency and occasional request “tail‑back” effects. To mitigate these, increase data‑reading concurrency, avoid many small files, and relax strict read order. For image datasets, combine small files into large TFRecord files to reduce request count. Ensure sufficient CPU resources for data preprocessing; insufficient CPU leads to GPU idle time. Convert raw text inputs to binary formats to reduce CPU overhead and fully utilize GPU compute.

Overall, deep learning platforms continue to evolve (e.g., Kubeflow, OpenPAI) toward end‑to‑end solutions that let algorithm engineers focus on data and models while the platform handles the rest.

deep learningContainerizationStorage OptimizationCloud PlatformGPU computingAI training
iQIYI Technical Product Team
Written by

iQIYI Technical Product Team

The technical product team of iQIYI

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.