Mastering epoll: Deep Dive into Linux I/O Multiplexing
This article thoroughly examines Linux's epoll mechanism, detailing its SLAB memory management, middle‑layer design, edge and level triggering, comparison with select/poll, and related advanced polling technologies such as /dev/poll and kqueue, while also discussing C10K/C10M challenges and practical solutions.
epoll技术补充
1. SLAB内存管理
SLAB内存管理特点
使用连续的内存地址空间来存储epitem/epoll,避免内存碎片
epitem/epoll释放后放入对象池重复利用,减少创建销毁的性能开销
内存分配原理如下:
epoll创建对象源码
<code>// eventpoll.c
ep = kzalloc(sizeof(*ep), GFP_KERNEL);
// slab.h
static inline void *kzalloc(size_t size, gfp_t gfp)
{
return kmalloc(size, gfp | __GFP_ZERO);
}
/* Slab cache used to allocate "struct epitem" */
static struct kmem_cache *epi_cache __read_mostly;
/* Slab cache used to allocate "struct eppoll_entry" */
static struct kmem_cache *pwq_cache __read_mostly;
</code>epoll通过SLAB机制创建对象,避免碎片并使用对象池提升性能。
2. epoll设计思想
采用中间层设计思想
epoll空间以及epitem部分源代码
<code>struct eventpoll {
/* Wait queue used by sys_epoll_wait() */
wait_queue_head_t wq;
/* Wait queue used by file->poll() */
wait_queue_head_t poll_wait;
/* List of ready file descriptors */
struct list_head rdllist;
/* Lock which protects rdllist and ovflist */
rwlock_t lock;
/* RB tree root used to store monitored fd structs */
struct rb_root_cached rbr;
/* Single linked list of epitem that happened while transferring ready events */
struct epitem *ovflist;
};
struct epitem {
union {
/* RB tree node links this structure to the eventpoll RB tree */
struct rb_node rbn;
/* Used to free the struct epitem */
struct rcu_head rcu;
};
/* List header used to link this structure to the eventpoll ready list */
struct list_head rdllink;
/* The file descriptor information this item refers to */
struct epoll_filefd ffd;
/* The "container" of this item */
struct eventpoll *ep;
/* wakeup_source used when EPOLLWAKEUP is set */
struct wakeup_source __rcu *ws;
/* The structure that describes the interested events and the source fd */
struct epoll_event event;
};
</code>epoll使用中间层将socket绑定到epitem,并通过红黑树和单链表管理就绪事件。
3. epoll其他技术要点
边缘与条件触发
边缘触发:当socket缓冲区收到数据时触发;水平触发:只要缓冲区非空就持续可读。
<code>// 默认水平触发 EPOLLONESHOT, 边缘触发 EPOLLET
list_for_each_entry_safe(epi, tmp, head, rdllink) {
if (esed->res >= esed->maxevents)
break;
// 执行唤醒逻辑
ws = ep_wakeup_source(epi);
if (ws) {
if (ws->active)
__pm_stay_awake(ep->ws);
__pm_relax(ws);
}
// 移除epitem下的ready_list
list_del_init(&epi->rdllink);
// 重新轮询事件收集就绪事件
revents = ep_item_poll(epi, &pt, 1);
if (!revents)
continue;
// 将就绪事件拷贝到用户空间中
if (__put_user(revents, &uevent->events) ||
__put_user(epi->event.data, &uevent->data)) {
list_add(&epi->rdllink, head);
ep_pm_stay_awake(epi);
if (!esed->res)
esed->res = -EFAULT;
return 0;
}
esed->res++;
uevent++;
if (epi->event.events & EPOLLONESHOT)
epi->event.events &= EP_PRIVATE_BITS;
else if (!(epi->event.events & EPOLLET)) {
list_add_tail(&epi->rdllink, &ep->rdllist);
ep_pm_stay_awake(epi);
}
}
#define EP_PRIVATE_BITS (EPOLLWAKEUP | EPOLLONESHOT | EPOLLET | EPOLLEXCLUSIVE)
</code>水平触发在每次调用 epoll_wait 时都会检查并读取剩余数据,边缘触发则仅在新数据到达时触发。
高级轮询技术
/dev/poll
<code>struct dvpoll {
struct pollfd* dp_fds; // 链表形式的缓冲区
int dp_nfds; // 缓冲区大小
int timeout;
}
wfd = open("/dev/poll", O_RDWR, 0);
write(wfd, pollfd, MAX_SIZE); // pollfd 为 poll 结构体数组
ioctl(wfd, DP_POLL, &dvpoll);
</code>/dev/poll 在 Solaris 上提供可扩展的轮询,预先设置文件描述符列表后循环等待事件。
kqueue技术
<code>// 返回一个新的 kqueue 描述符
int kqueue(void);
// 注册或获取事件
int kevent(int kq,
const struct kevent *changelist, int nchanges,
struct kevent *eventlist, int nevents,
const struct timespec *timeout);
// 设置事件
void EV_SET(struct kevent *kev, uintptr_t ident, short filter,
u_short flags, u_int fflags, intptr_t data, void *udata);
// kevent 结构体
struct kevent {
uintptr_t ident;
short filter;
u_short flags;
u_int fflags;
intptr_t data;
void *udata;
};
</code>kqueue 与 epoll 原理相似,但支持更多事件类型,主要用于 FreeBSD。
C10K问题与解决方案
C10K 指支持一万并发连接的服务,常见解决方案包括单线程 + IO 复用(select/poll/epoll/kqueue)、边缘触发、AIO、线程池以及使用 Nginx、libevent、Netty 等框架。
成熟技术方案如 Nginx、libevent、Netty 已广泛用于高并发场景。
本文至此结束,欢迎转发和点赞。
Xiaokun's Architecture Exploration Notes
10 years of backend architecture design | AI engineering infrastructure, storage architecture design, and performance optimization | Former senior developer at NetEase, Douyu, Inke, etc.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.