Building a High‑Availability ClickHouse Cluster with RaftKeeper
This article explains how RaftKeeper leverages the Raft consensus algorithm to create a high‑availability, high‑performance ClickHouse cluster across multiple data centers, covering project background, architecture, core features, performance optimizations, and real‑world deployment results.
ClickHouse is widely used for fast large‑scale data analytics, but its reliance on Zookeeper for metadata, logs, and coordination creates performance bottlenecks as data volume and request rates grow.
RaftKeeper adopts the Raft distributed consensus protocol to provide a high‑performance, highly available coordination service that can replace Zookeeper in ClickHouse clusters.
1. Project Background
In ClickHouse, Zookeeper stores metadata, logs operation history, and coordinates services such as service discovery, distributed queues, and locks. The massive data size and high request throughput expose Zookeeper’s performance limits, leading to read‑only tables, DDL timeouts, and part‑merge failures.
To overcome these issues, JD.com developed RaftKeeper, an open‑source project that aims for full Zookeeper API compatibility, significantly higher throughput, and reduced blocking caused by JVM GC or state‑machine map expansion.
Source code: https://github.com/JDRaftKeeper/RaftKeeper
2. Raft Distributed Consensus Algorithm
Raft solves data consistency among replicas in distributed systems. Compared with Paxos, Raft is easier to understand and implement, and is used in systems such as Kudu, TiDB, and Etcd.
Raft’s two key processes are leader election and log replication; writes go through the leader and are replicated to followers to guarantee consistency.
3. Why NuRaft?
RaftKeeper builds on the open‑source NuRaft library, which adds useful extensions to Raft (Pro‑Vote, leader expiration, leader assignment, configurable quorum sizes) and provides a lightweight, well‑structured code base with elegant examples. Its pipeline and batch request handling greatly improve performance.
4. Core Features of RaftKeeper
Full compatibility with the Zookeeper API and ecosystem.
Write throughput roughly twice that of Zookeeper (see benchmark).
More stable TP99 latency.
Preserves request order within a session.
Supports 300,000 concurrent long‑lived connections via a multiplexed network model.
5. Architecture Design
The core modules are StateMachine, LogStore, and Snapshot. StateMachine implements data access, watch, and ephemeral‑node features. LogStore and Snapshot handle persistence using a Snapshot + Binlog approach with segmented files, versioned CRC checks, and optional compression.
Consistency guarantees include ordered responses for requests within the same session and global commit order across all sessions.
6. Session‑Level Request Ordering
Writes flow through a pipeline: IO receives the request, Forwarder forwards it to the leader, the leader batches it, Raft replicates the log, and Processor executes the request. To keep ordering while scaling, each pipeline stage runs on a set of PipeRunners (thread + connection + queue). Requests from the same session are hashed to the same PipeRunner, ensuring they stay on the same logical pipe.
Read requests bypass the pipeline and are processed directly by the Processor, which still respects a globally increasing XID to preserve order in mixed read/write workloads.
7. Performance Optimizations
Pipeline and batch processing are the primary optimizations. The article uses a courier‑delivery analogy to illustrate how batching ten packages at a time reduces total time tenfold, and how adding a relay courier halves the time again.
In RaftKeeper, an Accumulator component batches write requests early in the log‑replication phase, and a streaming processing mechanism splits large requests into smaller chunks, dramatically increasing throughput.
Latency improvements include a two‑level hash table in the state machine to limit expansion pauses, and an asynchronous snapshot mechanism that writes full snapshots to disk without blocking client requests.
8. Production Results
Since its launch in November 2021, RaftKeeper runs in more than 60 JD.com clusters and has withstood three major sales events. Throughput doubled compared with Zookeeper, and TP99 latency became more stable.
Failure rates of high‑frequency imports dropped sharply, and cluster size grew from ~60 nodes to ~200 nodes.
9. Cross‑DataCenter Architecture
RaftKeeper enables ClickHouse high‑availability across zones. Each zone hosts two ClickHouse replicas; RaftKeeper nodes are spread across three zones (two nodes in zone 1 and zone 2, one node in zone 3). The design tolerates the loss of an entire zone while keeping the service online.
Unlike Zookeeper, RaftKeeper allows manual leader placement in a specific zone, reducing cross‑zone latency for data ingestion.
10. Q&A
Q1: How will RaftKeeper maintain compatibility with ClickHouse Keeper in the future?
A: RaftKeeper aims to support all Zookeeper use‑cases; ClickHouse Keeper focuses on ClickHouse‑specific scenarios. Functionality is largely consistent, but the two projects evolve independently.
Q2: What advantages does RaftKeeper have over ClickHouse’s native Keeper?
A: Higher throughput, guaranteed session‑order semantics, NIO network model supporting ~300k long connections, and architectural differences such as session management and pipeline/batch processing.
Q3: What is the current production scale and future plan for RaftKeeper?
A: Deployed in over 60 JD.com environments; future work includes expanding to more use‑cases and further community collaboration.
Thank you for reading.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.