Databases 18 min read

How Tencent’s CKV+ Redefines High‑Performance NoSQL KV Storage with DPDK

CKV+, Tencent’s next‑generation high‑performance NoSQL KV database compatible with Redis, Memcached and ASN, evolves from early CMEM architectures, adopts a multi‑tenant shared‑memory design built on Seastar, integrates DPDK for user‑space networking, and achieves up to ten million QPS with significant cost and latency improvements.

Tencent Architect

Jan 24, 2022

How Tencent’s CKV+ Redefines High‑Performance NoSQL KV Storage with DPDK

1. CKV Architecture Evolution

CKV (also called CKV+) is Tencent’s new generation self‑developed high‑performance NoSQL KV database, compatible with Redis, Memcached, ASN protocols, offering high performance, low latency, low cost, suitable for massive data access and cost‑sensitive scenarios.

It is widely used in services such as Guangdiantong, Xinge, QQ Music, Tenpay, Weishi, Kankan, etc.

To further reduce cost and improve performance, CKV has been optimized in both performance and cost aspects.

Below we review CKV’s architectural evolution, analyze current bottlenecks, and present performance optimizations compared with open‑source Redis and competing products.

1.1 CMEM Architecture

CKV originated from a series of distributed memory systems:

2009: TMEM (Tencent Memcached) launched for QZone, Pengyou.

2011: Renamed CMEM (Cloud Memcached), supporting massive third‑party game users.

2012: Introduced cold‑data storage to lower memory cost.

2014: Supported WeChat Red Packet service.

2017: CKV built on Seastar, compatible with Redis protocol, further improving performance.

1.1.1 Architecture Overview

CMEM consists of core modules (access, storage, metadata) and peripheral modules (backup, migration, statistics, probing, recovery). Data is presharded into 10,000 slots (hash(key) % 10000), each mapped to a primary‑secondary shard pair.

1.1.2 Advantages and Disadvantages

Advantages

Shared‑memory storage enables fast recovery on process failure and efficient memory use for uniform data lengths.

Supports open‑source memcached protocol.

Migration has minimal impact, affecting only delete operations.

Multi‑tenant design maximizes resource utilization.

Disadvantages

Many modules increase fault‑localization complexity.

Whole‑machine primary‑secondary design creates load imbalance.

rsync‑based master‑slave sync is single‑threaded, becoming a bottleneck under high load.

Engine less friendly to random‑length data.

Multi‑process locking in storage limits CPU usage.

Metadata stored locally without replication; full‑push updates are inefficient.

1.2 CKV Architecture

CKV removes unnecessary CMEM modules, retaining essential core (storage, metadata) and peripheral (backup, OSS) modules. Metadata is stored in an ETCD service with at least three replicas for reliability.

The storage module handles direct access; if the request is local, it processes it, otherwise it forwards to the target node.

Peripheral modules include backup (cold backup and recovery) and OSS (cluster management, instance creation/deletion, scaling).

1.2.1 Core Architecture

CKV adopts a two‑layer design: storage and metadata management. Metadata resides in ETCD with a 1‑to‑1 mapping to the metadata service, providing three‑way data redundancy.

Storage uses a single‑process multi‑thread model; shards are allocated on demand, balancing primary‑secondary counts per node. Threads can also act as access layers.

Data model follows presharding (hash(key) % 16384), aligning slot count with Redis and supporting one‑master‑multiple‑backup.

1.2.2 Thread Model

Cache uses a peer‑to‑peer model where each CPU runs a thread handling a set of shards. Shards are lock‑free; each shard is managed by a single CPU, eliminating lock overhead.

Requests are routed to the responsible thread; if not local, they are forwarded. Each thread also handles network I/O, performing packet decoding and routing, thus fully utilizing CPU cores.

Operations are implemented with future/promise asynchronous patterns, allowing I/O to proceed without blocking.

2 Performance Bottlenecks and Analysis

2.1 Performance Data

CKV performance on a 72‑core machine (Seastar‑based) shows significant improvements over CMEM and Redis.

2.2 CKV Performance Analysis

Flame graphs indicate CPU consumption mainly in seastar::reactor::run_tasks (command execution and packet reception) and seastar::reactor::poll_once (packet transmission). mpstat shows soft interrupts consuming ~30% CPU, primarily due to NIC packet reception.

2.3 Redis Performance Analysis

Under the same hardware, 72 Redis instances were benchmarked. CPU usage is also dominated by network I/O and soft interrupts, confirming that network stack is the primary bottleneck.

Replacing the kernel network stack with a user‑space stack (DPDK) can alleviate this limitation.

3 DPDK Integration

DPDK (Intel Data Plane Development Kit) provides user‑space packet processing, bypassing the kernel stack.

3.1 Initializing Seastar

Seastar initialization steps:

On CPU 0, call smp::configure to set up configuration.

On CPU 0, call rte_eal_init to initialize DPDK EAL.

Use rte_eal_remote_launch on non‑zero CPUs to configure the reactor and protocol stack.

After all CPUs finish, initialize the engine and protocol stack on CPU 0.

3.2 Protocol Stack Initialization (DPDK)

Three main steps:

NIC initialization: retrieve device info, set isolation mode via rte_flow_isolate, then configure NIC with rte_eth_dev_configure.

Queue initialization: set up RX and TX queues using rte_eth_rx_queue_setup and rte_eth_tx_queue_setup, then poll with rte_eth_rx_burst and rte_eth_tx_burst.

NIC start: call rte_eth_dev_start, create flow rules with rte_flow_create, and verify link status.

3.3 Packet Processing

RX and TX pollers are registered as reactor::poller::simple([&] { return poll_rx_once(); }) and reactor::poller::simple([this] { return poll_tx(); }). RSS distributes packets to specific queues, ensuring flow affinity to a CPU.

Packets traverse from DPDK’s rte_mbuf to higher‑level streams (L2, L3, L4). The user‑space stack supports ARP, DHCP, ICMP, IP, TCP, UDP; TCP connections are represented by TCB objects with lock‑free queues, achieving up to 200 MB/s per connection.

3.4 Isolation Mode and Flow Rules

DPDK’s isolate mode keeps most traffic in the kernel stack while directing selected flows to DPDK, preventing interference with system services.

4 Performance Results

Single‑shard six‑core test shows CKV write performance 60% higher than Tair, with comparable read performance.

Six‑shard six‑core scenario demonstrates a ~100% improvement when using DPDK versus kernel stack.

Full‑machine tests achieve up to ten million QPS after DPDK integration, confirming that network I/O was the dominant bottleneck.

Conclusion

Through performance‑focused optimizations and DPDK integration, CKV’s overall throughput reaches ten million QPS, with ongoing work to further reduce cost via SSD storage and continue enhancing the native protocol stack.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance NoSQL DPDK seastar kv database high‑performance

Written by

Tencent Architect

We share technical insights on storage, computing, and access, and explore industry-leading product technologies together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.