High‑Performance Network Solutions: RDMA, RoCE, iWARP and io_uring – Principles, Implementation and Benchmark Analysis
The article reviews high‑performance networking options—RDMA (including RoCE v2 and iWARP) and Linux’s io_uring—explaining their principles, hardware requirements, and benchmark results, and concludes that while RDMA delivers ultra‑low latency for specialized workloads, io_uring offers modest network benefits, leaving TCP as the default for most services.
Background
With NIC speeds increasing to 10G/25G/100G and latency requirements tightening to sub‑millisecond levels, the traditional Linux kernel network stack becomes a bottleneck for many latency‑sensitive services.
Although RDMA and DPDK have been adopted in some industry projects, most developers are still unfamiliar with these techniques.
This article briefly summarizes several high‑performance network solutions (RDMA, RoCE, iWARP, io_uring) from the perspectives of principle, feasibility and practical experience.
1. RDMA
(a) Principle
RDMA provides kernel bypass by offloading protocol processing to a dedicated NIC and exposing memory regions directly to user‑space applications, eliminating costly kernel mediation.
Key features
Zero‑Copy: DMA‑based data transfer without extra CPU copies.
Stable low latency: hardware‑level reliability ensures consistent communication delay.
Multiple transport modes (RC, RD, UC, UD) that resemble TCP/UDP semantics.
RDMA relies on the reliability of the underlying network; it typically avoids the complex reliability mechanisms found in TCP.
Implementation categories vary with the transport network (e.g., InfiniBand, RoCE).
(b) RoCE v2 vs iWARP
Both are Ethernet‑based RDMA solutions. RoCE v2 requires lossless Ethernet (or vendor‑specific optimizations), while iWARP works over standard TCP/IP but is less widely supported in data‑center deployments.
2. io_uring / socket
(a) Principle
io_uring (Linux 5.1+) introduces a true proactor model with two shared queues: the Submission Queue (SQ) and Completion Queue (CQ). Applications enqueue SQEs, the kernel processes them, and CQEs are returned to user‑space.
Key advantages:
True asynchronous design (proactor) instead of the reactor‑style epoll.
Unified framework for storage, network and many system calls.
Support for advanced features such as file/buffer registration, automatic buffer selection, SQ polling, IO polling, multi‑shot operations, and batch submission/consumption.
(b) Benchmark setup
Kernel: Linux 5.15
Protocol: TCP echo
Server model: single‑threaded asynchronous
Client: multi‑threaded, each thread holds a persistent connection
Payload: 512 B per packet
Environment: loopback interface on the same machine
Tested models:
epoll (baseline)
io_uring – Proactor (direct recv/send)
io_uring – Reactor (socket fd registered with POLL_ADD)
3. Test results and analysis
In network I/O, io_uring does not outperform epoll; the kernel network stack remains the primary bottleneck.
Enabling kernel polling reduces latency under low load but yields little performance gain at saturation, while consuming extra CPU cycles.
4. Business implications
For most backend services, TCP remains the stable choice; RDMA is viable only for a limited set of latency‑critical workloads.
Adopting io_uring for network I/O alone offers minimal benefit; however, combining io_uring for storage I/O with epoll for network I/O may reduce system‑call overhead.
Future work includes evaluating a unified io_uring approach for both network and storage paths.
Conclusion
RDMA, RoCE, iWARP and io_uring each have distinct advantages and constraints. RDMA excels in ultra‑low‑latency scenarios but requires specialized hardware and lossless networks. io_uring shines for storage workloads but provides limited gains for pure network I/O.
Next article will explore DPDK‑based solutions.
Author
Wang Meng – Tencent backend development engineer, responsible for C++ backend framework, with extensive experience in high‑concurrency development and performance tuning.
Recommended reading
Asynchronous Programming Guide
Redis Delay Queue in Go – High‑Performance Practice
Prometheus Monitoring Essentials
Tencent Cloud Developer
Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.