Operations 18 min read

Design and Performance Optimization of Twemproxy Using Nginx Multi‑Process Architecture

The project re‑engineers Twemproxy by adopting Nginx’s master‑worker multi‑process model, adding non‑blocking accept locks, reuseport load balancing, CPU affinity and crash isolation, which transforms the single‑threaded proxy into a scalable, low‑latency, high‑QPS solution suitable for public‑cloud high‑concurrency workloads.

Didi Tech

Jan 15, 2019

Design and Performance Optimization of Twemproxy Using Nginx Multi‑Process Architecture

Development background : Existing open‑source cache proxy middlewares such as twemproxy and codis have limitations. Twemproxy is single‑process, single‑threaded and only supports standalone Memcached and Redis, lacking cluster support. Its performance is low (short‑connection QPS ≈ 30k, long‑connection QPS ≈ 130k) and suffers from latency jitter.

To meet high‑concurrency demands on public‑cloud platforms, the project re‑engineers twemproxy by borrowing Nginx’s high‑performance, high‑reliability, high‑concurrency mechanisms, introducing a master + multiple worker processes to achieve layer‑7 forwarding.

Twemproxy overview : Twemproxy is a fast single‑threaded proxy written in C, supporting Memcached ASCII and newer Redis protocols. Its features include fast speed, lightweight design, persistent server connections, request/response pipelining, multi‑backend support, server pools, consistent hashing, detailed monitoring, and cross‑platform compatibility.

Identified bottlenecks of native twemproxy :

Single‑process, single‑threaded – cannot exploit multi‑core CPUs.

CPU usage >70% when short‑connection QPS reaches 8k, causing latency spikes.

IO blocking under high traffic.

High maintenance cost for scaling (multiple instances required).

Difficult to upgrade and scale.

These issues manifest as “one person doing the work, many watching” – only one CPU core is utilized while others stay idle.

Why adopt Nginx’s multi‑process model : Both twemproxy and Nginx are network‑IO‑intensive, layer‑7 forwarding applications with strict latency requirements. Nginx efficiently utilizes multi‑core CPUs, offers low latency, and provides a mature master‑worker architecture.

Master‑worker mechanism : A master process manages multiple worker processes. Workers handle client requests; the master monitors workers, forwards signals, and restarts workers on failure. Workers can be configured in number.

Key performance challenges and solutions :

Thundering‑herd problem on low‑version Linux kernels (<2.6): solved by a non‑blocking accept lock (trylock) so only one worker acquires the accept lock at a time.

Load‑balancing among workers: each worker maintains a max_threshold and local_threshold; when the local threshold exceeds the max, the worker releases the accept lock, allowing others to accept connections.

Utilizing Linux 3.9+ reuseport: enables multiple processes to bind the same IP/port, providing kernel‑level load balancing and eliminating accept‑lock contention.

Master‑worker communication: implemented via signal mechanisms and a channel based on socketpair, similar to Nginx’s IPC, to exchange configuration and monitoring data.

CPU affinity: workers are pinned to specific CPUs to reduce context‑switch overhead.

Worker crash isolation: if a worker crashes, only its share of traffic (~1/20 with 20 workers) is affected; the master instantly respawns a new worker.

Network optimization : Detected high soft‑interrupt concentration on a few CPUs; applied RSS (Receive Side Scaling) to distribute interrupts across cores. Added TCP_QUICKACK via setsockopt after recv to reduce latency spikes.

Performance comparison :

Before modification: native twemproxy cluster QPS 5‑6k, latency distribution showed high jitter and 40 ms spikes.

After modification (master + worker, same worker count): latency roughly halved; stability improved with no jitter.

Offline benchmark with reuseport on Linux 3.10 showed significant QPS gains for both 100‑byte and 150‑byte payloads on multi‑core machines.

Conclusions : The multi‑process redesign dramatically reduces latency (up to 3×) and improves stability. Multi‑process architecture is preferred over multi‑threading due to lock‑free design, better fault isolation, and easier hot‑reload/upgrade.

Future plans : Add hot‑loadable configuration, code hot‑upgrade, and eventually build a generic high‑performance TCP proxy framework inspired by Nginx, supporting modular protocol extensions.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

performance Linux multi-process network optimization nginx Cache Proxy Twemproxy

Written by

Didi Tech

Official Didi technology account

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.