How ByteHouse Enhances ClickHouse with Resource Isolation and High Availability
This article explains how ByteHouse, an enhanced version of ClickHouse used at ByteDance, adds full upsert support, multi‑table joins, high‑availability features, and, most importantly, a Resource Group mechanism that provides fine‑grained CPU, memory, and concurrency isolation to improve query performance and stability.
ByteHouse is ByteDance's internally enhanced version of ClickHouse that addresses several limitations of the open‑source engine.
Missing complete upsert and delete operations.
Weak multi‑table join capabilities.
Degraded availability when the cluster scales large.
Lack of resource isolation.
To overcome these issues, ByteDance launched a five‑part enhancement plan covering upsert support, multi‑table joins, query optimization, high availability, and resource isolation. This article focuses on the resource isolation component.
Resource Group Solution
ByteHouse introduces a Resource Group component that partitions CPU, memory, and concurrency resources into distinct groups with parent‑child relationships, allowing shared resources while enforcing isolation.
Concurrency Control
The
max_concurrent_queriessetting limits the number of simultaneous queries per group. When the limit is reached, additional queries are queued until a running query finishes, after which the earliest queued query is executed. If a child group’s queue is empty, the parent group releases resources, enabling queued queries in sibling groups.
Memory Control
Each group can define a soft memory limit. Queries exceeding this limit are placed in a waiting queue, and the parent group's limit is also considered, allowing memory sharing across groups. Since precise memory‑usage estimation is unavailable, ByteHouse uses an estimate‑plus‑correction approach based on the
memory_tracker.
CPU Control
ByteHouse leverages Linux cgroups’ CPU controller. CPU shares are allocated according to predefined
cpu_sharesvalues. The effective CPU proportion for a group is:
cpu_shares / sum(cpu_shares)when multiple groups are active, or
100%when only one group runs. Thus the usable CPU range is
[cpu_shares/sum(cpu_shares), 100%], guaranteeing a minimum CPU share for each group and high overall CPU utilization.
Impact of Resource Groups
Resource Groups dramatically improve query experience by protecting high‑priority workloads, reducing query latency variance, and preventing out‑of‑memory kills that could destabilize the cluster.
Full‑stack isolation for CPU, memory, and concurrency.
Limiting impact of low‑priority queries.
Mitigating adverse effects of heavy write statements.
Performance tests in ByteDance's advertising business show average query latency dropping from 2.3‑14.1 seconds (pre‑deployment) to 0.4‑1.7 seconds (post‑deployment).
ByteDance Data Platform
The ByteDance Data Platform team empowers all ByteDance business lines by lowering data‑application barriers, aiming to build data‑driven intelligent enterprises, enable digital transformation across industries, and create greater social value. Internally it supports most ByteDance units; externally it delivers data‑intelligence products under the Volcano Engine brand to enterprise customers.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.