Enhancing ClickHouse High Availability: Reducing Zookeeper Load, Faster Recovery, and Additional Reliability Improvements
ByteDance’s article details the high‑availability challenges of ClickHouse in large‑scale deployments—such as frequent failures, long recovery times, and operational complexity—and explains three key enhancements: a new HaMergeTree engine to lessen Zookeeper load, RocksDB‑based metadata persistence for faster restarts, and additional reliability features like HaKafka and monitoring tools.
Introduction : ClickHouse is renowned for its powerful data‑analysis performance, but ByteDance discovered several limitations when using it at massive scale, including missing full upsert/delete support, weak multi‑table join capabilities, reduced availability as cluster size grows, and lack of resource isolation.
To address these issues, ByteDance decided to comprehensively strengthen ClickHouse, focusing in this article on improving high‑availability.
01. Availability problems encountered by ByteDance
Rapid business growth led to a rapid increase in ClickHouse nodes and daily partitions, causing metadata inconsistencies and a sharp drop in cluster availability. The three main symptoms are:
More frequent failures (hardware faults, Zookeeper bottlenecks, etc.)
Long fault‑recovery time (often >1 hour due to many partitions)
Increased operational complexity (more nodes and partitions require many more operators to keep the cluster stable)
These issues became a critical barrier to business development, prompting a systematic breakdown and resolution plan.
02. Solutions to improve high‑availability
1. Reduce Zookeeper pressure
Native ClickHouse uses the ReplicatedMergeTree engine, which relies heavily on Zookeeper for replica election, data sync, and fault recovery. Zookeeper stores not only table‑level metadata but also logical logs and part information, making it a performance bottleneck at scale.
To alleviate this, ByteDance introduced a new HaMergeTree engine that minimizes Zookeeper interactions:
Retains only table‑level metadata in Zookeeper
Simplifies logical log allocation
Removes part information from Zookeeper logs
HaMergeTree reduces the amount of operation logs stored in Zookeeper, keeping only the log LSN while actual logs are synchronized between replicas via a Gossip protocol. The engine remains fully compatible with ReplicatedMergeTree, dramatically lowering Zookeeper load and eliminating Zookeeper‑related anomalies even in clusters with thousands of nodes or tens of thousands of tables.
2. Improve fault‑recovery capability
ClickHouse keeps most metadata in memory, causing server restarts to take one to two hours, which prolongs recovery after a failure. ByteDance solved this by persisting metadata to RocksDB. On startup, the server loads metadata directly from RocksDB, keeping only essential part information in memory.
The architecture stores each table’s part metadata in a dedicated RocksDB instance; on first launch, part metadata is persisted to RocksDB, and subsequent restarts load it from there. This reduces memory usage and speeds up both startup and recovery.
After metadata persistence, the system can support over one million parts per node without memory constraints, and recovery time drops from hours to a few minutes (e.g., from 1‑2 hours to about 3 minutes), greatly enhancing high‑availability.
3. Other enhancements
Beyond the two major improvements, ByteDance added several features to further boost reliability, such as the HaKafka engine for highly available real‑time data ingestion, a comprehensive monitoring and alerting platform, and various diagnostic tools for rapid fault localization.
Stability is paramount for data‑analysis platforms, and ByteDance’s ongoing work aims to provide the strongest possible reliability guarantees behind extreme performance.
ByteHouse, the enhanced ClickHouse offering, is now available through Volcano Engine with multiple versions to suit different users. Free trials can be requested on the ByteHouse website.
Scan the QR code to try it for free, read the original article, or join the ByteHouse & ClickHouse community group for technical discussions.
Thank you for reading.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.