Big Data 22 min read

Multi-Cloud Unified Data Acceleration Layer at Xiaohongshu: Challenges, Alluxio Solution, and Performance Gains

This article presents Xiaohongshu's multi‑cloud unified data acceleration layer built with Alluxio, detailing the challenges of multi‑cloud architectures, the design goals, Alluxio's architecture and features, real‑world case studies in AI training and recommendation indexing, performance improvements, and future plans.

DataFunSummit
DataFunSummit
DataFunSummit
Multi-Cloud Unified Data Acceleration Layer at Xiaohongshu: Challenges, Alluxio Solution, and Performance Gains

Xiaohongshu, a popular short‑video and content sharing platform, faces significant challenges in its multi‑cloud architecture, including complex cross‑zone communication, high latency, scarce and expensive dedicated lines, and inefficient utilization of massive CPU/GPU resources for AI training and recommendation services.

Typical problems identified are low utilization of large‑scale CPU/GPU resources for machine‑learning training, slow index distribution in recommendation services, massive small‑file metadata handling, and high costs and instability caused by extensive cross‑cloud data transfers.

To address these issues, Xiaohongshu built a multi‑cloud unified data acceleration layer using Alluxio, which provides a transparent caching layer that can reuse existing data without migration, supports both S3 and POSIX protocols, controls cross‑cloud bandwidth, and scales to billions of AI training metadata entries.

Alluxio’s architecture consists of a Master for metadata, Workers for data caching and reads, Job Master and Job Workers for asynchronous tasks, and an Alluxio Client that fetches data from underlying storage when not cached.

Key features of Alluxio include format transparency, protocol compatibility (S3, POSIX, HDFS, etc.), a unified multi‑cloud view, single‑pass data transfer with caching, and high‑performance data access for AI/ML workloads.

Case studies demonstrate substantial performance gains: AI training tasks saw a 41% reduction in training time and higher CPU utilization; recommendation index distribution achieved over 10× faster data transfer and 80% cost savings by replacing cloud disks with object storage; large‑model downloads benefited similarly.

Additional innovations include intelligent cache management that pre‑loads hot data, pinning recent training samples, load progress monitoring with automatic fail‑over, and bandwidth throttling to protect dedicated lines.

Future plans involve creating a unified multi‑cloud storage product, improving GPU utilization across regions, accelerating big‑data queries at low cost, and increasing CPU utilization of under‑used Alluxio nodes.

Big DataMulti-CloudcachingAlluxioAI trainingData Acceleration
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.