Design and Implementation of High‑Concurrency (C10M) Load Balancing in Alibaba's AGW Middlebox
The article analyzes the challenges of scaling network devices to handle ten‑million concurrent connections (C10M) and describes Alibaba's AGW solution, which uses lock‑free data planes, hugepages, NUMA‑aware memory placement, and user‑space NIC drivers to achieve high‑performance four‑layer load balancing.
With the rapid global expansion of the Internet, traffic to giant sites such as Google, Facebook, and China’s BAT companies has exploded, pushing the need for servers that can handle far more than the classic C10K problem.
Today the industry faces the C10M challenge, which is commonly interpreted as supporting 10 million concurrent connections, processing 10 million packets per second, creating 1 million new connections per second, and moving 10 Gbps of traffic.
Although exact figures such as whether 10 Gbps is sufficient remain debatable, the C10M metric paints a clear picture of the scale required for future services.
MiddleBox devices—network appliances that transform, inspect, filter, or otherwise manipulate traffic—sit at the network edge and are the first point of contact for massive traffic bursts, making them prime candidates for C10M testing. Alibaba extensively uses MiddleBox technology, including 4/7‑layer load balancers and 4‑layer proxy gateways; this article focuses on the work done for the 4‑layer load balancers.
1. Lock‑free implementation: The data plane achieves lock‑free operation by assigning each CPU core its own session table and distributing NIC I/O streams across multiple cores for parallel processing, eliminating lock contention. The control plane remains lock‑free by handling VIP table updates via periodic polling rather than on‑demand commands, addressing multi‑core scalability (points 2 and 3).
2. Hugepage usage: Dedicated 1 GB huge pages are allocated for the AGW gateway to reduce TLB misses and associated memory operations; a mempool mechanism is employed for frequently allocated data structures.
3. NUMA‑aware core data support: Because modern servers are built on NUMA architectures, session tables are placed in the memory of the NUMA node that hosts the corresponding CPU core. NIC queues are bound to CPUs on the same NUMA node, avoiding cross‑node memory accesses, reducing bus bottlenecks and cache contention, and improving overall performance (points 4 and 5 address memory scalability).
During the C10K era, technologies such as epoll emerged, but the traditional Linux netfilter‑based four‑layer load‑balancing architecture struggles under C10M loads due to three main bottlenecks:
Network‑card interrupt handling: increasing traffic leads to longer interrupt processing times, degrading service performance.
Linux network stack complexity: the full kernel stack adds latency, prompting the need for a fast‑path packet processing path.
Multi‑core scalability: as CPU core counts rise, the kernel’s network stack does not scale linearly, relying heavily on locks and causing contention.
To overcome these issues, Alibaba’s network product team developed the AGW (AliGateWay) system, a user‑space solution that discards the Linux netfilter framework and implements full‑nat plus syn‑proxy functionality.
Key technologies of the AGW design include:
User‑space NIC driver using linuxuio to bypass the kernel network stack, allowing applications to handle packets directly and solving packet‑scale limitations.
Multi‑core, multi‑queue architecture: dedicated CPU cores are bound to NIC queues and processing threads; RSS and multi‑queue features distribute packets across cores, each maintaining its own session table.
Lock‑free data and control planes as described above, eliminating lock contention for both packet processing and VIP table updates.
Hugepage allocation (1 GB pages) to reduce TLB misses and improve memory allocation efficiency, complemented by a mempool for frequently used structures.
NUMA‑aware placement of core data structures (session tables) and binding of NIC queues to the same NUMA node, minimizing cross‑node memory traffic and enhancing cache utilization.
These combined techniques address packet, multi‑core, and memory scalability, enabling the system to handle C10M‑level traffic and providing robust support for Alibaba’s massive events such as Double‑11.
Alibaba Cloud Infrastructure
For uninterrupted computing services
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.