Backend Development 12 min read

Performance Optimizations and Benchmark Analysis of RaftKeeper v2.1.0

The article presents a detailed engineering analysis of RaftKeeper v2.1.0, describing benchmark methodology, performance gains across create, mixed, and list workloads, and four major optimizations—including response serialization parallelism, list‑request handling, system‑call reduction, thread‑pool redesign, and asynchronous snapshot processing—demonstrating substantial throughput and latency improvements in large‑scale ClickHouse deployments.

DataFunTalk
DataFunTalk
DataFunTalk
Performance Optimizations and Benchmark Analysis of RaftKeeper v2.1.0

RaftKeeper is a high‑performance distributed consensus service compatible with Zookeeper, widely deployed in ClickHouse to alleviate Zookeeper bottlenecks and applicable to other big‑data components such as HBase. Version 2.1.0 introduces several new features, most notably asynchronous snapshot creation, and delivers significant performance improvements.

Performance Test Setup

Benchmarks were run with the raftkeeper-bench tool on a three‑node cluster (16 CPU cores, 32 GB RAM, 100 GB storage per node). Versions compared: RaftKeeper v2.1.0, RaftKeeper v2.0.4, and ZooKeeper 3.7.1, all using default configurations.

Test Results

Two test groups were executed. In the pure create workload (value size 100 bytes), RaftKeeper v2.1.0 outperformed v2.0.4 by 11 % and ZooKeeper by 143 %. In a mixed workload (create 1 %, set 8 %, get 45 %, list 45 %, delete 1 %), RaftKeeper v2.1.0 showed 118 % improvement over v2.0.4 and 198 % over ZooKeeper. Average response time (avgRT) and TP99 metrics were also superior.

1. Response Parallel Serialization

Flame‑graph analysis revealed that the ResponseThread spent roughly one‑third of its CPU time on response serialization. By moving serialization to the IO thread and parallelizing it, latency was reduced. The code snippet shows the mutex‑protected queue where the issue originated:

///responses_queue is a mutex‑based sync queue, releasing response_for_session in tryPop adds lock time
responses_queue.tryPop(response_for_session, std::min(max_wait, static_cast<UInt64>(1000)))

The fix releases the memory before calling tryPop , eliminating the extra lock overhead.

2. List‑Request Optimization

List requests consumed most of the request‑processor thread’s CPU because each result is a std::vector<string> requiring individual dynamic allocations. A compact string representation using a contiguous data buffer and an offset array was introduced, reducing CPU usage from 5.46 % to 3.37 % and increasing TPS from 458 k/s to 619 k/s.

3. Elimination of Unnecessary System Calls

Profiling with bpftrace identified excessive getsockname and getsockopt calls originating from logging code:

const auto socket_name = sock.isStream() ? sock.address().toString() : sock.peerAddress().toString();
LOG_TRACE(log, "Dispatch event {} for {} ", notification.name(), socket_name);

Removing these calls reduced kernel‑user context switches and improved overall latency.

4. Thread‑Pool Redesign

Benchmarking showed that the request‑processor thread spent over 60 % of its time waiting on condition variables. By processing read requests in a single thread instead of a thread pool, TPS increased by 13 % and avgRT decreased from 2407 µs to 1846 µs.

/// 1. process read‑request by a thread pool
for (RunnerId runner_id = 0; runner_id < runner_count; runner_id++) {
    request_thread->trySchedule([this, runner_id] {
        moveRequestToPendingQueue(runner_id);
        processReadRequests(runner_id);
    });
}
/// 2. wait read request processing
request_thread->wait();
/// 3. process write‑request in single thread
processCommittedRequest(committed_request_size);

Snapshot Optimizations

Asynchronous snapshot creation copies the DataTree in the main thread and serializes it in the background, reducing blocking time for a 60 M‑entry snapshot from 180 s to 4.5 s (at the cost of ~50 % extra memory). Further vectorized copying using SSE intrinsics cut the copy time from 4.5 s to 3.5 s:

inline void memcopy(char * __restrict dst, const char * __restrict src, size_t n) {
    auto aligned_n = n / 16 * 16;
    auto left = n - aligned_n;
    while (aligned_n > 0) {
        _mm_storeu_si128(reinterpret_cast<__m128i *>(dst), _mm_loadu_si128(reinterpret_cast<const __m128i *>(src)));
        dst += 16; src += 16; aligned_n -= 16;
        __asm__ __volatile__("" : : : "memory");
    }
    ::memcpy(dst, src, left);
}

Snapshot loading was parallelized by distributing bucket‑wise DataTree reconstruction across threads, reducing load time from 180 s to 99 s, and later to 22 s after additional lock and format optimizations.

Production Impact

In a high‑traffic ClickHouse cluster (≈170 k QPS, dominated by list requests), upgrading from ZooKeeper to RaftKeeper v2.0.4 showed degraded performance, but RaftKeeper v2.1.0 delivered a substantial throughput boost and lower latency, confirming the effectiveness of the optimizations.

Conclusion

The engineering improvements in RaftKeeper v2.1.0—spanning serialization, request handling, system‑call reduction, thread‑pool redesign, and asynchronous snapshot processing—demonstrate how targeted performance engineering can dramatically enhance the scalability of distributed coordination services in large‑scale ClickHouse deployments.

Performance OptimizationZooKeeperC++ClickHouseBenchmarkdistributed consensusRaftKeeper
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.