Artificial Intelligence 16 min read

Atlas Supercomputing Platform: Architecture, Alluxio‑Fluid Integration, and Performance Improvements for AI Workloads

The article presents CloudKnow's Atlas supercomputing platform, detailing its AI‑focused architecture, early storage and bandwidth challenges, the integration of Alluxio and Fluid for distributed caching, various business adaptations, and experimental results showing significant performance gains across speech denoising, image classification, large‑file processing, and speech recognition workloads.

DataFunTalk
DataFunTalk
DataFunTalk
Atlas Supercomputing Platform: Architecture, Alluxio‑Fluid Integration, and Performance Improvements for AI Workloads

Guide: CloudKnow's Atlas supercomputing platform serves as the underlying infrastructure for AI model training and inference, providing high‑performance compute and massive data storage capabilities. It supports mainstream machine‑learning frameworks and enables efficient development of speech, language, big‑data, and multimodal technologies.

Background and Early Issues: Atlas is the core AI research platform at CloudKnow, built on thousands of GPU cards (A100, V100, RTX6000) and a Lustre‑based distributed file system. Early problems included storage bandwidth bottlenecks, IO contention on shared GPU nodes, a flood of small files (WAV, JPG) stressing metadata services, and data redundancy across multiple storage directories.

Early Solutions Attempted:

Limit storage bandwidth per node and enforce per‑user IOPS caps.

Restrict the number of small files and encourage aggregation into LMDB or TFRecord formats.

Re‑design the scheduler to prioritize idle nodes and avoid intra‑node competition.

Introduce multi‑level caching, though the initial implementation lacked automation and metadata management.

These measures did not fully resolve the challenges.

Alluxio + Fluid Integration: In 2019 the Atlas team evaluated Alluxio and later adopted Fluid as a lightweight caching layer. Alluxio provides POSIX‑compatible interfaces that match TensorFlow and PyTorch workloads, while Fluid offers observable, controllable caching with Kubernetes‑native deployment. The combined stack replaces heavyweight Docker‑based Alluxio deployments with a more flexible, low‑overhead solution.

Business Adaptations:

Non‑root & host‑path support: Fluid now mounts caches via hostPath, and an init container injects user UID/GID so that training pods access storage with correct permissions.

Unified resource view: Alluxio aggregates multiple distributed file systems, presenting a single namespace for users and allowing Fluid to mount them under a common directory.

Data pre‑warming: Alluxio’s distributed load feature pre‑loads TB‑scale small‑file datasets, eliminating hours‑long loading times.

Automated submission tool: The atlasctl cache create command lets users specify cache type, size, and target nodes, after which Fluid schedules the cache and the task scheduler automatically selects cached nodes for training.

Experimental Results:

Scenario 1 – Speech Denoising (small files): Using 10 RTX6000 GPUs with full‑memory caching achieved a 10× speedup, higher GPU utilization, and near‑zero storage bandwidth consumption.

Scenario 2 – Image Classification (medium files): On a ResNet‑50 ImageNet TFRecord workload, Alluxio caching delivered a 2.5× throughput increase compared with direct Lustre access.

Scenario 3 – Large File (125 GB LMDB): Alluxio with data pre‑warming reduced bandwidth usage from >1 GB/s to near zero and accelerated training by ~30×.

Scenario 4 – Speech Recognition (DDP + Alluxio): Distributed training on 2 machines × 20 GPUs saw a 10 % reduction in total training time (20 min → 18 min) after cache optimization.

Overall Benefits: The Alluxio‑Fluid stack improves model production efficiency (up to 10× for small‑file workloads), dramatically lowers storage bandwidth load, increases GPU utilization, and provides observable cache management.

Future Work: The team plans to collaborate with algorithm groups to refine tuning parameters, expand SSD‑based cache capacity, and further enhance Fluid’s disk‑scheduling capabilities.

Thank you for attending the session.

AIkubernetesDistributed StorageAlluxioSupercomputingFluid
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.