Cloud Computing 14 min read

Hybrid Cloud Architecture and AI Storage Evolution at Zhihu: From UnionStore to Alluxio

This article describes Zhihu's hybrid cloud architecture—including offline, online, and GPU data centers—its self‑built UnionStore cache, the performance and latency challenges faced during large‑scale AI model training, and the subsequent evaluation and migration to Alluxio community and enterprise editions to achieve higher throughput, stability, and lower operational overhead.

DataFunTalk

May 14, 2024

Hybrid Cloud Architecture and AI Storage Evolution at Zhihu: From UnionStore to Alluxio

Zhihu operates a hybrid cloud architecture that separates workloads across three data‑center types: an offline room for batch processing, an online room serving core user‑facing services with low latency, and a GPU‑focused room providing machine‑learning platforms and high‑performance GPU resources.

The hybrid setup brings cost and disaster‑recovery benefits but also introduces storage challenges, especially for AI training tasks that require low‑latency access to large datasets across data‑center links.

To address cross‑cloud data access, Zhihu built a custom cache system called UnionStore, which unifies object storage and HDFS. UnionStore first checks object storage for a file, reads directly if present, or copies the file from HDFS to object storage before serving it, effectively providing a cross‑room caching layer.

When Zhihu began training large language models in 2023, UnionStore showed several limitations: high metadata latency, insufficient read throughput for massive checkpoints, bandwidth bottlenecks under high concurrency, and inability to stream data while caching.

Faced with these issues, the team evaluated open‑source storage solutions and selected Alluxio because it met three key requirements: protocol compatibility (S3 and POSIX), superior performance, and transparent caching that maps directly to existing HDFS paths.

Alluxio’s community edition offered features such as transparent caching, customizable metadata and data caches, broad UFS support, ad‑hoc query acceleration, and rich S3/FUSE interfaces, delivering read speeds of 500 MB/s (FUSE) and over 1 GB/s (S3 proxy). Migration from UnionStore to Alluxio was completed in three months, yielding 2‑3× faster model distribution and a 60 % reduction in training‑time latency.

However, as training workloads grew, the community edition revealed stability and scalability problems: frequent OOM‑related Fuse restarts, a single‑master metadata bottleneck, limited write performance for checkpointing, and high operational overhead due to manual image building and Kubernetes manifest management.

The Alluxio enterprise edition addressed these challenges by redesigning the cluster architecture—distributing metadata across workers via consistent hashing, introducing an ETCD‑based service discovery, improving Fuse stability with Netty‑based data transfer, and providing higher write throughput (up to 1.4 GB/s). It also offered an operator for one‑click deployment, dramatically reducing operational complexity.

Overall, the migration to Alluxio (both community and enterprise) enabled Zhihu to sustain large‑scale AI training with improved performance, stability, and operational efficiency, while continuing to explore further optimizations for AI storage in a hybrid cloud environment.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

performance Big Data Operations hybrid cloud Alluxio AI storage UnionStore

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.