Comprehensive Overview of epoll Data Structures and Implementation in Linux
This article systematically explains epoll's core data structures—including a red‑black tree and a doubly linked ready list—its three main APIs, the kernel implementation details, thread‑safety mechanisms, ET vs. LT behavior, and how it improves over select/poll for high‑concurrency network programming.
Recently, candidates interviewing at top internet companies such as Tencent, Meituan, Pinduoduo, Youzan, and Xiyin have encountered several critical interview questions about epoll, including its data structures, implementation principles, interaction with the protocol stack, thread‑safety locking, and the differences between edge‑triggered (ET) and level‑triggered (LT) modes.
1. epoll Data Structures
Operating environment: epoll works between the application and the kernel protocol stack and exists only when both the protocol stack and VFS are present.
The core data structures of epoll consist of one red‑black tree and one doubly linked list, along with three primary APIs.
1.1 Red‑Black Tree
Array lookup is fast but insertion/deletion are O(n).
Binary search trees have O(log n) lookup but can degrade to O(n) if unbalanced.
B+ trees are optimal for disk‑based storage but unnecessary for in‑memory management of tens of thousands of file descriptors.
Because epoll manages up to tens of thousands of file descriptors in memory, a red‑black tree provides balanced O(log n) performance for insert, delete, and search operations, making it the ideal choice.
1.2 Ready‑Socket List – Doubly Linked List
The ready list stores sockets that are ready for I/O and must support fast insertion and deletion. When a socket is added or removed via epoll_ctl , the same node is simultaneously part of the red‑black tree and the ready list, allowing lock‑free removal from both structures.
1.3 Three Core APIs
int epoll_create(int size)Creates an epoll instance and returns a file descriptor (epfd). It also allocates the red‑black tree and ready list. The size argument is ignored in modern kernels.
int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event)Registers, modifies, or removes a monitored file descriptor. It also registers a callback with the kernel so that when the descriptor becomes ready, the callback adds the node to the ready list.
int epoll_wait(int epfd, struct epoll_event *events, int maxevents, int timeout)Blocks until at least one monitored descriptor is ready, copies the ready events to user space, and returns the number of events.
2. Implementation Principles
Why epoll? It provides an efficient event‑driven I/O model for handling massive concurrent connections, reducing kernel‑user data copies, and supporting both edge‑triggered and level‑triggered modes.
Limitations of select/poll: They are limited to 1024 descriptors, require copying the entire descriptor set on each call, and provide limited information about which descriptors are ready.
2.1 epoll Operations
When epoll_create is called, the kernel allocates an eventpoll structure containing a red‑black tree ( rbr ) for all monitored descriptors and a doubly linked list ( rdllist ) for ready events.
Each monitored descriptor is represented by an epitem structure, which is linked into both the red‑black tree and the ready list. The kernel registers a callback ( ep_poll_callback ) that moves an epitem to the ready list when the associated event occurs.
2.2 I/O Multiplexing Modes
Level‑Triggered (LT): The kernel notifies the application as long as the descriptor remains ready, ensuring repeated notifications until the data is fully processed.
Edge‑Triggered (ET): The kernel notifies the application only on state changes; if the application does not drain the socket completely, no further notifications are sent until another event occurs.
LT is simpler to program, while ET offers higher efficiency by avoiding redundant notifications.
3. Protocol Stack Interaction with epoll The communication between the protocol stack and epoll is asynchronous and decoupled. Notifications occur when a TCP three‑way handshake completes (generating EPOLLIN ) or when data arrives and an ACK is sent. 4. Thread‑Safety and Locking Red‑black tree modifications are protected by a mutex, while the ready list uses a spinlock because its operations are simple and frequent. The callback ep_poll_callback acquires the spinlock to safely add nodes to the ready list. 4.1 Waiting Queue Implementation Waiting queues consist of a head ( wait_queue_head_t ) and members ( wait_queue_t ). Each epitem registers a wait‑queue entry with the target file's wait queue via add_wait_queue , specifying ep_poll_callback as the wake‑up function. 4.2 Core Structures struct epitem { struct rb_node rbn; // Red‑black tree node struct list_head rdllink; // Ready list link struct epoll_filefd ffd; // Associated file descriptor int nwait; // Number of wait‑queue registrations struct list_head pwqlist; // Poll wait‑queue list struct eventpoll *ep; // Back‑pointer to eventpoll struct epoll_event event; // User‑specified events }; struct eventpoll { spinlock_t lock; // Protects internal data struct mutex mtx; // Serializes epoll_ctl/epoll_wait wait_queue_head_t wq; // Wait queue for epoll_wait wait_queue_head_t poll_wait; // Wait queue for poll on epoll fd struct list_head rdllist; // Ready list struct rb_root rbr; // Red‑black tree of epitems struct epitem *ovflist; // Temporary overflow list struct user_struct *user; // Per‑user limits }; 5. ET vs. LT Implementation ET mode triggers the callback only once per state change, while LT re‑adds the epitem to the ready list after each epoll_wait if the descriptor remains ready, ensuring continuous notifications. 6. Source Code Highlights The kernel implementation resides in fs/eventpoll.c . Key functions include sys_epoll_create1 , ep_insert , ep_ptable_queue_proc , ep_poll_callback , ep_poll , and ep_scan_ready_list . These functions coordinate the allocation of eventpoll and epitem structures, registration of poll callbacks, sleeping and waking of waiting processes, and copying of ready events to user space. Overall, epoll combines a balanced red‑black tree for efficient descriptor management with a lock‑protected ready list to achieve high‑performance, scalable I/O multiplexing suitable for modern server applications.
Deepin Linux
Research areas: Windows & Linux platforms, C/C++ backend development, embedded systems and Linux kernel, etc.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.