Databases 19 min read

Design and Architecture of Zeppelin Distributed Block Storage System

This article presents an in‑depth overview of Zeppelin, a high‑availability, high‑performance block storage service, covering its motivation, online vs offline storage distinctions, data distribution strategies, centralized meta‑server design, replication policies, RocksDB‑based storage engine, Raft‑based consistency protocol, threading model, client request flow, and fault‑handling mechanisms.

Architecture Digest
Architecture Digest
Architecture Digest
Design and Architecture of Zeppelin Distributed Block Storage System

The presentation introduces Zeppelin, a storage project developed by Qihoo 360 alongside other systems such as Bada and Pika, which already run hundreds of instances in production environments.

It explains why a new storage system is needed: different scenarios demand different solutions, with online storage requiring low latency and diverse interfaces, while offline storage focuses on high throughput for batch analytics and is largely standardized around HDFS.

Zeppelin is positioned as a highly available, high‑performance, customizable‑consistency block storage service that can serve as the underlying layer for various protocols, including S3‑compatible object storage, POSIX‑style file systems, and NewSQL interfaces.

Data distribution in Zeppelin balances uniformity and consistency; the system adopts a fixed‑hash partitioning scheme because it is simple to implement, scales by allowing the number of partitions to exceed the number of servers, and eases operational management.

Zeppelin uses a center‑node (meta‑server) architecture to store metadata. This design offers simplicity, clear updates, and strong consistency, though it introduces a potential single point of failure; mitigations include extensive client caching, heartbeat mechanisms, and optimizations to reduce meta‑server load.

Replication currently follows a Master‑Slave model, with future plans to support Quorum and Erasure Coding strategies to balance availability, performance, and storage efficiency.

The storage engine is a customized fork of RocksDB called nemo‑rocksdb, which adds features such as TTL support and background compaction.

Metadata consistency is guaranteed by Floyd, a C++ implementation of the Raft consensus algorithm. Floyd provides strong consistency for writes and an optional dirty‑read interface for higher read performance.

The thread model is divided into several categories: meta‑information threads (Heartbeat and MetaCmd), user command threads (Dispatch and Worker), synchronization threads (TrySync, Binlog Sender/Receiver, and associated background workers), and background maintenance threads (Binlog Purge, BGSave, and DBSync).

Client request flow: a client first queries the Meta Server for table metadata, computes the partition for the target key, and then directly contacts the master node of that partition to execute the operation.

Failure detection works via periodic PING messages from each node to the Meta Server; when a node or network failure is detected, the Meta Server updates the epoch, prompting other nodes to pull the latest metadata and reconfigure master‑slave relationships accordingly.

Source: Zhihu article

ReplicationDistributed StorageRocksDBRaftblock storageZeppelinhash partitioning
Architecture Digest
Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.