Deep Dive into epoll: Principles, Blocking, and I/O Multiplexing
This article provides an in‑depth exploration of Linux’s epoll mechanism, covering its blocking behavior, kernel‑level processing, NAPI optimization, comparisons with select/poll, and practical insights into I/O multiplexing, helping backend engineers understand performance characteristics and design efficient network services.
1 Introduction
Epoll is an old but essential topic for backend engineers; many people study it, leading to varied understandings and misconceptions.
This article revisits epoll, focusing on thread blocking principles, interrupt optimization, NIC data handling, and the underlying mechanisms, and also critiques popular viewpoints.
2 Motivation
Before the main content, the author asks several questions about epoll performance, blocking vs non‑blocking, synchronous vs asynchronous I/O, and why select, poll, and epoll exist.
3 Getting Started with epoll
Epoll is Linux kernel’s scalable I/O event notification mechanism with superior performance. A benchmark from libevent compares select, poll, epoll, and kqueue, showing epoll’s response time remains stable as the number of sockets grows, unlike select and poll.
The benchmark limits active connections to 100; epoll excels when many sockets are idle, but with many active sockets the advantage diminishes.
4 The Principles Behind epoll
4.1 Blocking
4.1.1 Why Blocking
Using a NIC as example, the data‑receiving process consists of four steps: DMA write to memory, IRQ generation, kernel interrupt handling, and user‑space processing.
Because the kernel must wait for data (milliseconds) while user processing is nanoseconds, the process blocks, freeing CPU.
4.1.2 Blocking Does Not Consume CPU
Linux defines several process states; only runnable processes use CPU. Blocked processes are idle, and runnable processes are placed in a work queue.
4.1.3 Unblocking
The kernel identifies the socket’s owning process via the socket’s PID and port, changes its state to runnable, and the scheduler later runs it.
4.1.4 Process Model
This is essentially blocking I/O (BIO) where each process handles its own socket.
4.2 Optimizing Context Switches
Two sources of frequent context switches are NIC IRQ handling and per‑socket process wake‑ups.
4.2.1 NIC NAPI Mechanism
NAPI splits the interrupt handler into a fast IRQ part (napi_schedule) and a soft‑irq part (net_rx_action) that processes packets in batches.
The simplified flow: DMA write → IRQ → napi_schedule → soft‑irq → batch packet processing → user‑space.
4.2.2 Single‑Thread I/O Multiplexing
Kernel I/O multiplexing reduces per‑socket context switches by using a single thread to handle many sockets, similar to NAPI.
Select’s implementation uses an fd_set limited to 1024 descriptors; epoll uses an instance handle backed by a red‑black tree and a ready‑list for O(1) operations.
4.3 Evolution of I/O Multiplexing APIs
Comparison of select and epoll: select performs O(n) scans of fd_set, while epoll maintains O(1) ready lists via a kernel‑managed data structure.
Code snippets (shown as images) illustrate typical select and epoll usage.
4.4 Summary
Blocking improves CPU utilization by letting the kernel wait for data.
I/O multiplexing (and NAPI) reduces context switches.
Three API generations (select, poll, epoll) evolve to improve kernel‑process interaction.
5 Diss Section
The author presents personal interpretations, arguing that all Linux I/O models are fundamentally synchronous and that “blocking vs non‑blocking” or “sync vs async” classifications can be misleading.
5.1 Classification of I/O Models
Proposes two categories: the programmer‑oriented process model and the OS‑oriented I/O multiplexing model, with Reactor (Java NIO) and Proactor (Java AIO) as user‑space dispatch patterns.
5.2 About mmap
Clarifies that epoll does not use mmap; a strace of a demo confirms this.
END
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.