Backend Development 11 min read

Performance Optimizations in RaftKeeper v2.1.0: Benchmark Results and Engineering Details

The article presents a detailed engineering analysis of RaftKeeper v2.1.0, highlighting benchmark‑driven performance improvements such as 11% write throughput gains, up to 198% faster read‑write mixed workloads, and multiple optimizations—including response serialization, list‑request handling, system‑call reduction, thread‑pool redesign, and asynchronous snapshot processing—validated on large ClickHouse clusters.

JD Tech Talk

Jul 15, 2024

Performance Optimizations in RaftKeeper v2.1.0: Benchmark Results and Engineering Details

RaftKeeper is a high‑performance distributed consensus service compatible with ZooKeeper, now applied at scale in ClickHouse to overcome ZooKeeper bottlenecks and usable with other big‑data components like HBase. Version v2.1.0 introduces asynchronous snapshot creation and a series of performance‑focused engineering improvements.

Performance Test Results : Using the raftkeeper-bench tool on a three‑node cluster (16 CPU cores, 32 GB RAM, 100 GB storage per node), RaftKeeper v2.1.0 outperformed v2.0.4 by 11% on pure create operations and achieved 143% higher throughput compared to ZooKeeper. In a mixed workload (create 1 %, set 8 %, get 45 %, list 45 %, delete 1 %), it delivered 118% and 198% improvements over v2.0.4 and ZooKeeper respectively, with lower average latency (avgRT) and TP99.

1. Response Serialization Parallelization : The original single‑threaded ResponseThread serialized responses, consuming ~33% of CPU time. By moving serialization to the IO thread, latency is reduced. Profiling also revealed heavy usage of sdallocx_default (jemalloc free) within a mutex‑protected queue, prompting a pre‑pop memory release strategy.

/// responses_queue is a mutex‑based sync queue; releasing response_for_session before tryPop reduces lock time
responses_queue.tryPop(response_for_session, std::min(max_wait, static_cast<UInt64>(1000)))

Benchmarking showed that with a concurrency level of 10, TPS increased by 31% and AvgRT dropped by 32% after this change.

2. List‑Request Optimization : List handling dominated the request‑processor thread, with most CPU spent on string allocation and vector insertion. A compact string representation using separate data and offset buffers was introduced (see CompactStrings design), reducing CPU share from 5.46% to 3.37% and raising TPS from 458 k/s to 619 k/s while lowering TP99.

优化前：read requests 14826483, write requests 0, Read RPS: 458433, Read MiB/s: 2441.74, TP99 1.515 msec
优化后：read requests 14172371, write requests 0, Read RPS: 619388, Read MiB/s: 3156.67, TP99 0.381 msec

3. System‑Call Reduction : Profiling with BPFTrace identified excessive getsockname and getsockopt calls originating from logging code. Removing these calls eliminated unnecessary kernel‑user transitions.

BPFTRACE_MAX_PROBES=1024 bpftrace -p 4179376 -e '
tracepoint:syscalls:sys_enter_* { @start[tid] = nsecs; }
tracepoint:syscalls:sys_exit_* /@start[tid]/ {
    @time[probe] = sum(nsecs - @start[tid]);
    delete(@start[tid]);
    @cc[probe] = sum(1);
}
interval:s:10{ exit(); }
'

4. Thread‑Pool Redesign : The request‑processor thread spent >60% of its time waiting on condition variables. By eliminating the thread pool for read requests and processing them in a single thread, TPS improved by 13%.

优化前：thread_size, tps, avgRT(µs), TP90(µs), TP99(µs), TP999(µs), failRate
200,84416,2407.0,3800.0,4500.0,8300.0,0.0
优化后：200,108950,1846.0,3100.0,4000.0,5600.0,0.0

5. Asynchronous Snapshot Creation : Snapshot generation, previously a blocking operation (≈180 s for 60 M entries), is now performed by copying the DataTree in the foreground and serializing it asynchronously, reducing user‑visible latency to 4.5 s at the cost of ~50% extra memory. Further vectorized copying with SSE intrinsics cut copy time from 4.5 s to 3.5 s.

inline void memcopy(char * __restrict dst, const char * __restrict src, size_t n) {
    auto aligned_n = n / 16 * 16;
    auto left = n - aligned_n;
    while (aligned_n > 0) {
        _mm_storeu_si128(reinterpret_cast<__m128i *>(dst), _mm_loadu_si128(reinterpret_cast<const __m128i *>(src)));
        dst += 16; src += 16; aligned_n -= 16; __asm__ __volatile__("" : : : "memory");
    }
    ::memcpy(dst, src, left);
}

6. Snapshot Load Acceleration : The original two‑step load (parallel disk read + single‑threaded tree reconstruction) took 180 s. Parallelizing the second step by assigning buckets of the two‑level hash map to separate threads reduced load time to 99 s, and subsequent lock and format optimizations brought it down to 22 s.

Online Deployment Impact : In a production ClickHouse cluster with ~170 k QPS (mostly List requests), upgrading from ZooKeeper to RaftKeeper v2.0.4 degraded performance, but v2.1.0 delivered a substantial gain, confirming the effectiveness of the engineering optimizations.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Distributed Systems performance benchmark RaftKeeper

Written by

JD Tech Talk

Official JD Tech public account delivering best practices and technology innovation.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.