Hybrid Cloud Architecture and AI Storage Evolution at Zhihu: From UnionStore to Alluxio
This article describes Zhihu's hybrid cloud architecture—including offline, online, and GPU data centers—its self‑built UnionStore cache, the performance and latency challenges faced during large‑scale AI model training, and the subsequent evaluation and migration to Alluxio community and enterprise editions to achieve higher throughput, stability, and lower operational overhead.
Zhihu operates a hybrid cloud architecture that separates workloads across three data‑center types: an offline room for batch processing, an online room serving core user‑facing services with low latency, and a GPU‑focused room providing machine‑learning platforms and high‑performance GPU resources.
The hybrid setup brings cost and disaster‑recovery benefits but also introduces storage challenges, especially for AI training tasks that require low‑latency access to large datasets across data‑center links.
To address cross‑cloud data access, Zhihu built a custom cache system called UnionStore, which unifies object storage and HDFS. UnionStore first checks object storage for a file, reads directly if present, or copies the file from HDFS to object storage before serving it, effectively providing a cross‑room caching layer.
When Zhihu began training large language models in 2023, UnionStore showed several limitations: high metadata latency, insufficient read throughput for massive checkpoints, bandwidth bottlenecks under high concurrency, and inability to stream data while caching.
Faced with these issues, the team evaluated open‑source storage solutions and selected Alluxio because it met three key requirements: protocol compatibility (S3 and POSIX), superior performance, and transparent caching that maps directly to existing HDFS paths.
Alluxio’s community edition offered features such as transparent caching, customizable metadata and data caches, broad UFS support, ad‑hoc query acceleration, and rich S3/FUSE interfaces, delivering read speeds of 500 MB/s (FUSE) and over 1 GB/s (S3 proxy). Migration from UnionStore to Alluxio was completed in three months, yielding 2‑3× faster model distribution and a 60 % reduction in training‑time latency.
However, as training workloads grew, the community edition revealed stability and scalability problems: frequent OOM‑related Fuse restarts, a single‑master metadata bottleneck, limited write performance for checkpointing, and high operational overhead due to manual image building and Kubernetes manifest management.
The Alluxio enterprise edition addressed these challenges by redesigning the cluster architecture—distributing metadata across workers via consistent hashing, introducing an ETCD‑based service discovery, improving Fuse stability with Netty‑based data transfer, and providing higher write throughput (up to 1.4 GB/s). It also offered an operator for one‑click deployment, dramatically reducing operational complexity.
Overall, the migration to Alluxio (both community and enterprise) enabled Zhihu to sustain large‑scale AI training with improved performance, stability, and operational efficiency, while continuing to explore further optimizations for AI storage in a hybrid cloud environment.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.