Cloud Computing 10 min read

High‑Performance Network Solutions: RDMA, RoCE, iWARP and io_uring – Principles, Implementation and Benchmark Analysis

The article reviews high‑performance networking options—RDMA (including RoCE v2 and iWARP) and Linux’s io_uring—explaining their principles, hardware requirements, and benchmark results, and concludes that while RDMA delivers ultra‑low latency for specialized workloads, io_uring offers modest network benefits, leaving TCP as the default for most services.

Tencent Cloud Developer
Tencent Cloud Developer
Tencent Cloud Developer
High‑Performance Network Solutions: RDMA, RoCE, iWARP and io_uring – Principles, Implementation and Benchmark Analysis

Background

With NIC speeds increasing to 10G/25G/100G and latency requirements tightening to sub‑millisecond levels, the traditional Linux kernel network stack becomes a bottleneck for many latency‑sensitive services.

Although RDMA and DPDK have been adopted in some industry projects, most developers are still unfamiliar with these techniques.

This article briefly summarizes several high‑performance network solutions (RDMA, RoCE, iWARP, io_uring) from the perspectives of principle, feasibility and practical experience.

1. RDMA

(a) Principle

RDMA provides kernel bypass by offloading protocol processing to a dedicated NIC and exposing memory regions directly to user‑space applications, eliminating costly kernel mediation.

Key features

Zero‑Copy: DMA‑based data transfer without extra CPU copies.

Stable low latency: hardware‑level reliability ensures consistent communication delay.

Multiple transport modes (RC, RD, UC, UD) that resemble TCP/UDP semantics.

RDMA relies on the reliability of the underlying network; it typically avoids the complex reliability mechanisms found in TCP.

Implementation categories vary with the transport network (e.g., InfiniBand, RoCE).

(b) RoCE v2 vs iWARP

Both are Ethernet‑based RDMA solutions. RoCE v2 requires lossless Ethernet (or vendor‑specific optimizations), while iWARP works over standard TCP/IP but is less widely supported in data‑center deployments.

2. io_uring / socket

(a) Principle

io_uring (Linux 5.1+) introduces a true proactor model with two shared queues: the Submission Queue (SQ) and Completion Queue (CQ). Applications enqueue SQEs, the kernel processes them, and CQEs are returned to user‑space.

Key advantages:

True asynchronous design (proactor) instead of the reactor‑style epoll.

Unified framework for storage, network and many system calls.

Support for advanced features such as file/buffer registration, automatic buffer selection, SQ polling, IO polling, multi‑shot operations, and batch submission/consumption.

(b) Benchmark setup

Kernel: Linux 5.15

Protocol: TCP echo

Server model: single‑threaded asynchronous

Client: multi‑threaded, each thread holds a persistent connection

Payload: 512 B per packet

Environment: loopback interface on the same machine

Tested models:

epoll (baseline)

io_uring – Proactor (direct recv/send)

io_uring – Reactor (socket fd registered with POLL_ADD)

3. Test results and analysis

In network I/O, io_uring does not outperform epoll; the kernel network stack remains the primary bottleneck.

Enabling kernel polling reduces latency under low load but yields little performance gain at saturation, while consuming extra CPU cycles.

4. Business implications

For most backend services, TCP remains the stable choice; RDMA is viable only for a limited set of latency‑critical workloads.

Adopting io_uring for network I/O alone offers minimal benefit; however, combining io_uring for storage I/O with epoll for network I/O may reduce system‑call overhead.

Future work includes evaluating a unified io_uring approach for both network and storage paths.

Conclusion

RDMA, RoCE, iWARP and io_uring each have distinct advantages and constraints. RDMA excels in ultra‑low‑latency scenarios but requires specialized hardware and lossless networks. io_uring shines for storage workloads but provides limited gains for pure network I/O.

Next article will explore DPDK‑based solutions.

Author

Wang Meng – Tencent backend development engineer, responsible for C++ backend framework, with extensive experience in high‑concurrency development and performance tuning.

Recommended reading

Asynchronous Programming Guide

Redis Delay Queue in Go – High‑Performance Practice

Prometheus Monitoring Essentials

io_uringnetwork optimizationbenchmarkLinux KernelRDMAhigh-performance networking
Tencent Cloud Developer
Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.