Backend Development 19 min read

Mastering epoll: Deep Dive into Linux I/O Multiplexing

This article thoroughly examines Linux's epoll mechanism, detailing its SLAB memory management, middle‑layer design, edge and level triggering, comparison with select/poll, and related advanced polling technologies such as /dev/poll and kqueue, while also discussing C10K/C10M challenges and practical solutions.

Xiaokun's Architecture Exploration Notes
Xiaokun's Architecture Exploration Notes
Xiaokun's Architecture Exploration Notes
Mastering epoll: Deep Dive into Linux I/O Multiplexing

epoll技术补充

1. SLAB内存管理

SLAB内存管理特点

使用连续的内存地址空间来存储epitem/epoll,避免内存碎片

epitem/epoll释放后放入对象池重复利用,减少创建销毁的性能开销

内存分配原理如下:

epoll创建对象源码
<code>// eventpoll.c
ep = kzalloc(sizeof(*ep), GFP_KERNEL);

// slab.h
static inline void *kzalloc(size_t size, gfp_t gfp)
{
    return kmalloc(size, gfp | __GFP_ZERO);
}

/* Slab cache used to allocate "struct epitem" */
static struct kmem_cache *epi_cache __read_mostly;

/* Slab cache used to allocate "struct eppoll_entry" */
static struct kmem_cache *pwq_cache __read_mostly;
</code>

epoll通过SLAB机制创建对象,避免碎片并使用对象池提升性能。

2. epoll设计思想

采用中间层设计思想

epoll空间以及epitem部分源代码

<code>struct eventpoll {
    /* Wait queue used by sys_epoll_wait() */
    wait_queue_head_t wq;
    /* Wait queue used by file->poll() */
    wait_queue_head_t poll_wait;
    /* List of ready file descriptors */
    struct list_head rdllist;
    /* Lock which protects rdllist and ovflist */
    rwlock_t lock;
    /* RB tree root used to store monitored fd structs */
    struct rb_root_cached rbr;
    /* Single linked list of epitem that happened while transferring ready events */
    struct epitem *ovflist;
};

struct epitem {
    union {
        /* RB tree node links this structure to the eventpoll RB tree */
        struct rb_node rbn;
        /* Used to free the struct epitem */
        struct rcu_head rcu;
    };
    /* List header used to link this structure to the eventpoll ready list */
    struct list_head rdllink;
    /* The file descriptor information this item refers to */
    struct epoll_filefd ffd;
    /* The "container" of this item */
    struct eventpoll *ep;
    /* wakeup_source used when EPOLLWAKEUP is set */
    struct wakeup_source __rcu *ws;
    /* The structure that describes the interested events and the source fd */
    struct epoll_event event;
};
</code>

epoll使用中间层将socket绑定到epitem,并通过红黑树和单链表管理就绪事件。

3. epoll其他技术要点

边缘与条件触发

边缘触发:当socket缓冲区收到数据时触发;水平触发:只要缓冲区非空就持续可读。

<code>// 默认水平触发 EPOLLONESHOT, 边缘触发 EPOLLET
list_for_each_entry_safe(epi, tmp, head, rdllink) {
    if (esed->res >= esed->maxevents)
        break;
    // 执行唤醒逻辑
    ws = ep_wakeup_source(epi);
    if (ws) {
        if (ws->active)
            __pm_stay_awake(ep->ws);
        __pm_relax(ws);
    }
    // 移除epitem下的ready_list
    list_del_init(&epi->rdllink);
    // 重新轮询事件收集就绪事件
    revents = ep_item_poll(epi, &pt, 1);
    if (!revents)
        continue;
    // 将就绪事件拷贝到用户空间中
    if (__put_user(revents, &uevent->events) ||
        __put_user(epi->event.data, &uevent->data)) {
        list_add(&epi->rdllink, head);
        ep_pm_stay_awake(epi);
        if (!esed->res)
            esed->res = -EFAULT;
        return 0;
    }
    esed->res++;
    uevent++;
    if (epi->event.events & EPOLLONESHOT)
        epi->event.events &= EP_PRIVATE_BITS;
    else if (!(epi->event.events & EPOLLET)) {
        list_add_tail(&epi->rdllink, &ep->rdllist);
        ep_pm_stay_awake(epi);
    }
}
#define EP_PRIVATE_BITS (EPOLLWAKEUP | EPOLLONESHOT | EPOLLET | EPOLLEXCLUSIVE)
</code>

水平触发在每次调用 epoll_wait 时都会检查并读取剩余数据,边缘触发则仅在新数据到达时触发。

高级轮询技术

/dev/poll
<code>struct dvpoll {
    struct pollfd* dp_fds; // 链表形式的缓冲区
    int dp_nfds;          // 缓冲区大小
    int timeout;
}

wfd = open("/dev/poll", O_RDWR, 0);
write(wfd, pollfd, MAX_SIZE); // pollfd 为 poll 结构体数组
ioctl(wfd, DP_POLL, &dvpoll);
</code>

/dev/poll 在 Solaris 上提供可扩展的轮询,预先设置文件描述符列表后循环等待事件。

kqueue技术
<code>// 返回一个新的 kqueue 描述符
int kqueue(void);

// 注册或获取事件
int kevent(int kq,
           const struct kevent *changelist, int nchanges,
           struct kevent *eventlist, int nevents,
           const struct timespec *timeout);

// 设置事件
void EV_SET(struct kevent *kev, uintptr_t ident, short filter,
            u_short flags, u_int fflags, intptr_t data, void *udata);

// kevent 结构体
struct kevent {
    uintptr_t   ident;
    short       filter;
    u_short     flags;
    u_int       fflags;
    intptr_t    data;
    void       *udata;
};
</code>

kqueue 与 epoll 原理相似,但支持更多事件类型,主要用于 FreeBSD。

C10K问题与解决方案

C10K 指支持一万并发连接的服务,常见解决方案包括单线程 + IO 复用(select/poll/epoll/kqueue)、边缘触发、AIO、线程池以及使用 Nginx、libevent、Netty 等框架。

成熟技术方案如 Nginx、libevent、Netty 已广泛用于高并发场景。

本文至此结束,欢迎转发和点赞。

I/O multiplexingLinux kernelepollC10Kadvanced pollingedge triggeringlevel triggeringSLAB memory
Xiaokun's Architecture Exploration Notes
Written by

Xiaokun's Architecture Exploration Notes

10 years of backend architecture design | AI engineering infrastructure, storage architecture design, and performance optimization | Former senior developer at NetEase, Douyu, Inke, etc.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.