Backend Development 18 min read

How io_uring Integration Boosts Netpoll Throughput and Slashes Latency

This article examines the integration of Linux io_uring into ByteDance's high‑performance Netpoll NIO library, detailing architectural changes, receive/send workflows, benchmarking methodology, and results that show over 10% higher throughput and 20‑40% lower latency while eliminating system calls.

ByteDance SYS Tech
ByteDance SYS Tech
ByteDance SYS Tech
How io_uring Integration Boosts Netpoll Throughput and Slashes Latency

Introduction

Netpoll is a high‑performance NIO network library developed by ByteDance on top of epoll, focused on RPC scenarios. Compared with Go's native net package, Netpoll provides greater control over the network layer, enabling pre‑business optimizations.

Why io_uring?

io_uring, introduced in Linux 5.15, offers reduced system calls through batch processing and a flexible asynchronous I/O framework that can handle various I/O types, improving scalability.

Integration Design

The integration replaces traditional epoll‑based I/O with io_uring for both receive and send paths. The design keeps Netpoll's poller‑context model (one poller per 20 CPUs) and adds dedicated uring rings: one receive uring and multiple send urings, each created with SQ‑poll threads.

Poller Contexts

During initialization, Netpoll creates a main server poller that accepts new connections and distributes them across a set of poller contexts, each handling a subset of connections.

Receive Flow

New connections trigger EPOLLIN events; the poller allocates input buffers, performs recv , and dispatches a goroutine from the pool to run the user‑registered handler.

Send Flow

Applications call connection.Flush to transmit data. The send path uses sendmsg either directly in the user context or via a kernel SQ‑poll thread, depending on the configuration.

io_uring Model

Each io_uring instance consists of a Submission Queue (SQ), Completion Queue (CQ), and a buffer ring (PBQ). Entries can be singleshot (one‑off) or multishot (reused for multiple completions), the latter being especially useful for receive operations.

Batching Differences

Both epoll and io_uring rely on vfs_poll , but io_uring processes batches inside the kernel via a task‑work chain, reducing latency and system‑call overhead.

Benchmark Setup

A Go echo server (1 KB messages) was built with Netpoll using io_uring. The server ran on a 30‑CPU machine (19 CPUs for the app, remaining for SQ‑poll threads). Tests covered 10–1000 concurrent connections, 50 million operations per run, and measured throughput, latency (P99, P9999), and system‑call counts.

Results

Throughput increased 10‑15% for >200 connections, while latency dropped 20‑40% (P99) and 10‑20% (P9999) compared with epoll. System‑call count fell to zero in polling mode and up to 15× lower otherwise.

Conclusion

Integrating io_uring into Netpoll yields a measurable performance boost: >10% higher throughput, 20‑40% lower latency, and the elimination of system‑call overhead, achieving near‑zero system calls while preserving Netpoll's lightweight goroutine‑based design.

Goio_uringbenchmarkhigh-performance networkingNetpoll
ByteDance SYS Tech
Written by

ByteDance SYS Tech

Focused on system technology, sharing cutting‑edge developments, innovation and practice, and analysis of industry tech hotspots.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.