Big Data 22 min read

Multi-Cloud Unified Data Acceleration Layer at Xiaohongshu: Challenges, Alluxio Solution, and Performance Gains

This article presents Xiaohongshu's multi‑cloud unified data acceleration layer built with Alluxio, detailing the challenges of multi‑cloud architectures, the design goals, Alluxio's architecture and features, real‑world case studies in AI training and recommendation indexing, performance improvements, and future plans.

DataFunSummit

Jul 23, 2024

Multi-Cloud Unified Data Acceleration Layer at Xiaohongshu: Challenges, Alluxio Solution, and Performance Gains

Xiaohongshu, a popular short‑video and content sharing platform, faces significant challenges in its multi‑cloud architecture, including complex cross‑zone communication, high latency, scarce and expensive dedicated lines, and inefficient utilization of massive CPU/GPU resources for AI training and recommendation services.

Typical problems identified are low utilization of large‑scale CPU/GPU resources for machine‑learning training, slow index distribution in recommendation services, massive small‑file metadata handling, and high costs and instability caused by extensive cross‑cloud data transfers.

To address these issues, Xiaohongshu built a multi‑cloud unified data acceleration layer using Alluxio, which provides a transparent caching layer that can reuse existing data without migration, supports both S3 and POSIX protocols, controls cross‑cloud bandwidth, and scales to billions of AI training metadata entries.

Alluxio’s architecture consists of a Master for metadata, Workers for data caching and reads, Job Master and Job Workers for asynchronous tasks, and an Alluxio Client that fetches data from underlying storage when not cached.

Key features of Alluxio include format transparency, protocol compatibility (S3, POSIX, HDFS, etc.), a unified multi‑cloud view, single‑pass data transfer with caching, and high‑performance data access for AI/ML workloads.

Case studies demonstrate substantial performance gains: AI training tasks saw a 41% reduction in training time and higher CPU utilization; recommendation index distribution achieved over 10× faster data transfer and 80% cost savings by replacing cloud disks with object storage; large‑model downloads benefited similarly.

Additional innovations include intelligent cache management that pre‑loads hot data, pinning recent training samples, load progress monitoring with automatic fail‑over, and bandwidth throttling to protect dedicated lines.

Future plans involve creating a unified multi‑cloud storage product, improving GPU utilization across regions, accelerating big‑data queries at low cost, and increasing CPU utilization of under‑used Alluxio nodes.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data multi-cloud Caching Alluxio AI training Data Acceleration

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.