Backend Development 69 min read

Understanding epoll: Level‑Triggered vs Edge‑Triggered Modes and Kernel Implementation

This article explains the concepts of level‑triggered (LT) and edge‑triggered (ET) event notification in Linux epoll, describes epoll's core data structures such as the red‑black tree and ready‑list, details the three main APIs (epoll_create, epoll_ctl, epoll_wait), and examines the locking and internal kernel mechanisms that enable efficient I/O multiplexing.

Deepin Linux
Deepin Linux
Deepin Linux
Understanding epoll: Level‑Triggered vs Edge‑Triggered Modes and Kernel Implementation

Level‑triggered (LT) and edge‑triggered (ET) are two common trigger modes in event‑driven programming, especially when using the epoll system call for I/O multiplexing.

Level‑triggered (LT): The event is reported as long as the file descriptor remains ready; the kernel notifies the application repeatedly until the condition is cleared.

Edge‑triggered (ET): The event is reported only when the readiness state changes; the kernel notifies the application once and expects it to drain the data completely.

LT is simpler to program but may cause many redundant notifications.

ET reduces unnecessary wake‑ups and is preferred for high‑performance asynchronous I/O.

1. epoll Data Structures

epoll operates in the kernel with a dedicated eventpoll structure that contains a red‑black tree for managing all monitored file descriptors and a doubly‑linked list for the ready queue.

struct epitem {
    RB_ENTRY(epitem) rbn;          // node in the red‑black tree
    LIST_ENTRY(epitem) rdlink;    // node in the ready list
    int rdy;      // exists in ready list
    int sockfd;
    struct epoll_event event;
};

struct eventpoll {
    rb_root rbr;        // red‑black tree root
    int rbcnt;
    LIST_HEAD(, epitem) rdlist; // ready list
    int rdnum;
    int waiting;
    pthread_mutex_t mtx;   // protects the red‑black tree
    pthread_spinlock_t lock; // protects the ready list
    pthread_cond_t cond;   // condition variable for waiting threads
    pthread_mutex_t cdmtx; // mutex for the condition variable
};

Each monitored descriptor has an epitem that is linked both to the red‑black tree and to the ready list, allowing O(log N) lookup, insertion, and deletion.

2. Core epoll APIs

int epoll_create(int size) – creates an eventpoll instance and returns a file descriptor (the size argument is ignored in modern kernels).

int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event) – adds, modifies, or deletes a descriptor from the epoll set. It registers a poll callback, inserts the epitem into the red‑black tree, and, if the descriptor is already ready, places it on the ready list.

int epoll_wait(int epfd, struct epoll_event *events, int maxevents, int timeout) – blocks until at least one monitored descriptor becomes ready, then copies ready events to user space.

3. epoll_wait Implementation

The system call validates arguments, obtains the eventpoll object, and calls ep_poll() :

SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events,
                int, maxevents, int, timeout)
{
    int error;
    struct file *file;
    struct eventpoll *ep;
    if (maxevents <= 0 || maxevents > EP_MAX_EVENTS)
        return -EINVAL;
    if (!access_ok(VERIFY_WRITE, events, maxevents * sizeof(struct epoll_event)))
        return -EFAULT;
    file = fget(epfd);
    if (!file)
        return -EBADF;
    if (!is_file_epoll(file))
        return -EINVAL;
    ep = file->private_data;
    error = ep_poll(ep, events, maxevents, timeout);
    fput(file);
    return error;
}

ep_poll() converts the timeout to jiffies, checks the ready list, and if it is empty, puts the calling thread to sleep on ep->wq . The thread is awakened by the poll callback ep_poll_callback() when a monitored descriptor becomes ready.

4. Poll Callback and Event Delivery

The callback registered by ep_ptable_queue_proc() adds a wait‑queue entry that points to ep_poll_callback() . When the kernel notifies the descriptor, ep_poll_callback() moves the corresponding epitem to the ready list and wakes any thread waiting in epoll_wait .

static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *key)
{
    struct epitem *epi = ep_item_from_wait(wait);
    struct eventpoll *ep = epi->ep;
    spin_lock(&ep->lock);
    if (epi->event.events & ~EP_PRIVATE_BITS) {
        if (!ep_is_linked(&epi->rdllink))
            list_add_tail(&epi->rdllink, &ep->rdllist);
        wake_up_locked(&ep->wq);
    }
    spin_unlock(&ep->lock);
    return 1;
}

When epoll_wait wakes, ep_send_events() copies events from the ready list to the user‑provided buffer. For each epitem , it re‑invokes the file’s poll() method to obtain the current event mask, writes the mask and user data with __put_user , and, if the descriptor is in LT mode, re‑queues the epitem for the next epoll_wait call.

5. Locking Strategy

The red‑black tree is protected by a mutex ( ep->mtx ) because insert, delete, and modify operations may be performed concurrently by multiple threads.

The ready list is protected by a spinlock ( ep->lock ) since the operations are short and occur in both interrupt context (callback) and process context.

Waiting threads sleep on ep->wq (a wait‑queue) and are awakened without holding the spinlock.

6. ET vs LT in Code

When an event is delivered, ep_send_events_proc() checks the mode:

if (epi->event.events & EPOLLONESHOT)
    epi->event.events &= EP_PRIVATE_BITS; // one‑shot clears interest
else if (!(epi->event.events & EPOLLET))
    list_add_tail(&epi->rdllink, &ep->rdllist); // LT re‑queues
// ET does not re‑queue; the next notification occurs only after a state change

Thus, ET notifications occur only on state transitions, while LT notifications continue until the application drains the descriptor.

7. Summary

epoll provides an efficient, scalable I/O multiplexing mechanism by combining a red‑black tree for fast descriptor management with a ready‑list for immediate event delivery. Its three‑call API (create, ctl, wait) supports both level‑triggered and edge‑triggered modes, and careful locking (mutex for the tree, spinlock for the ready list) ensures correctness in multi‑threaded environments.

system programmingI/O multiplexingLinux Kernelevent-drivenepolledge-triggeredlevel-triggered
Deepin Linux
Written by

Deepin Linux

Research areas: Windows & Linux platforms, C/C++ backend development, embedded systems and Linux kernel, etc.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.