Unlocking Java NIO: How Select, Poll, and Epoll Revolutionize I/O Multiplexing
This article explains the evolution of I/O multiplexing in Java, covering the birth of multiplexing, the introduction of NIO with Selector, and detailed comparisons of select, poll, and epoll mechanisms, including their APIs, internal workings, and performance considerations for high‑concurrency network programming.
Hello everyone, I am Sanyou~~
1. Birth of Multiplexing
Non‑blocking I/O can handle all sockets with a single thread, but the cost is that the thread must frequently poll each socket for data, leading to many empty polls that waste performance.
We want a component that can monitor multiple sockets simultaneously and notify the process which sockets are "ready" when data is prepared, so the process only reads or writes on ready sockets.
Java introduced NIO in JDK 1.4 and provided the Selector component to achieve this functionality.
2. NIO
Before introducing NIO code, a few points need clarification.
The term "ready" is ambiguous because different sockets have different readiness criteria. For a listening socket, a connection from a client makes it ready, but it does not require read/write handling like a connected socket. For a connected socket, readiness includes data ready for reading ( is ready for reading ) and data ready for writing ( is ready for writing ).
Thus, when we let Selector monitor multiple sockets, we must tell the Selector which sockets and which events we are interested in. This action is called registration.
Now let's look at the code.
<code>public class NIOServer {
static Selector selector;
public static void main(String[] args) {
try {
// Obtain the selector multiplexer
selector = Selector.open();
ServerSocketChannel serverSocketChannel = ServerSocketChannel.open();
// The accept operation on the listening socket will not block
serverSocketChannel.configureBlocking(false);
serverSocketChannel.socket().bind(new InetSocketAddress(8099));
// Register the listening socket to the multiplexer and indicate interest in OP_ACCEPT events
serverSocketChannel.register(selector, SelectionKey.OP_ACCEPT);
while (true) {
// This method blocks
selector.select();
// Get all ready events; events are wrapped in SelectionKey
Set<SelectionKey> selectionKeys = selector.selectedKeys();
Iterator<SelectionKey> iterator = selectionKeys.iterator();
while (iterator.hasNext()) {
SelectionKey key = iterator.next();
iterator.remove();
if (key.isAcceptable()) {
handleAccept(key);
} else if (key.isReadable()) {
handleRead(key);
} else if (key.isWritable()) {
// Send data
}
}
}
} catch (IOException e) {
e.printStackTrace();
}
}
// Business logic for handling "read" events
private static void handleRead(SelectionKey key) {
SocketChannel socketChannel = (SocketChannel) key.channel();
ByteBuffer allocate = ByteBuffer.allocate(1024);
try {
socketChannel.read(allocate);
System.out.println("From Client:" + new String(allocate.array()));
} catch (IOException e) {
e.printStackTrace();
}
}
// Business logic for handling "accept" events
private static void handleAccept(SelectionKey key) {
ServerSocketChannel serverSocketChannel = (ServerSocketChannel) key.channel();
try {
// socketChannel is guaranteed to be non‑null and this call will not block
SocketChannel socketChannel = serverSocketChannel.accept();
// Set the connected socket to non‑blocking
socketChannel.configureBlocking(false);
socketChannel.write(ByteBuffer.wrap("Hello Client, I am Server!".getBytes()));
// Register the connected socket for "read" events
socketChannel.register(selector, SelectionKey.OP_READ);
} catch (IOException e) {
e.printStackTrace();
}
}
}
</code>We first use Selector.open(); to obtain the multiplexing object; then we create a listening socket on the server, set it to non‑blocking, and finally register the listening socket to the selector, indicating that when an OP_ACCEPT event occurs we should be notified.
In the while loop we call selector.select(); . The process blocks on this call until any registered socket has an event, then returns. If you set a breakpoint on the next line of selector.select(); and run in debug mode without any client connections, the breakpoint will not be triggered.
When select() returns, it means one or more sockets are ready. We use a Set<SelectionKey> to store all events; SelectionKey encapsulates the ready events, and we iterate over each event, handling them according to the event type.
If an OP_READ event is ready, we allocate a buffer and read data from kernel space into the buffer; if an OP_ACCEPT event is ready, we call the listening socket's accept() method to obtain a connected socket, which will not block because the listening socket was set to non‑blocking. The connected socket also needs to be set to non‑blocking, and then we register the connected socket for OP_READ events.
The above diagram explains the Java multiplexing code.
3. select
select blocks within a timeout until the sockets we are interested in become readable, writable, or have an exception. When it returns, the process receives an integer indicating the number of ready descriptors, and must iterate over the descriptor set to determine which sockets are ready.
In fact, "read ready", "write ready", or "exception" events involve many details; here we only consider the literal meaning. For more details, refer to "Unix Network Programming, Volume 1".
The user process gets this integer for two reasons:
All select operations run in kernel mode; after select returns, control returns to user space.
The process must examine each descriptor to see whether its corresponding bit is set to 1, indicating readiness.
The three parameters readfds , writefds , and exceptfds> are pointers that tell the kernel which descriptors we are interested in; after execution, the kernel writes back the readiness bits into these structures (value‑result parameters).</p> <p>We use <code>FD_ISSET(int fd, fd_set *fdset) to check each descriptor.
3.1. Summary
select (including poll ) is blocking; the process blocks on select rather than on the actual I/O system call. The model diagram is shown below.
We use a single user thread to handle all sockets, avoiding the ineffective polling of non‑blocking I/O, at the cost of one blocking select system call plus N system calls for ready file descriptors.
4. poll
poll is the successor of select . First, look at its function prototype:
<code>int poll(struct pollfd *fds, nfds_t nfds, int timeout);</code>The first parameter is an array of pollfd structures, defined as:
<code>struct pollfd {
int fd; /* file descriptor */
short events; /* events to look for */
short revents; /* events returned */
};</code>The events field uses a bitmask to specify the types of events we are interested in. The definitions (from /usr/include/bits/poll.h ) are:
<code>#define POLLIN 0x001 /* There is data to read. */
#define POLLPRI 0x002 /* There is urgent data to read. */
#define POLLRDNORM 0x040 /* Normal data may be read. */
#define POLLRDBAND 0x080 /* Priority data may be read. */
#define POLLOUT 0x004 /* Writing now will not block. */
#define POLLWRNORM 0x100 /* Writing now will not block. */
#define POLLWRBAND 0x200 /* Priority data may be written. */
#define POLLERR 0x008 /* Error condition. */
#define POLLHUP 0x010 /* Hung up. */
#define POLLNVAL 0x020 /* Invalid polling request. */
</code>Unlike select , poll stores the result of each poll in the revents field, so we do not need to reset the interest set before each call.
The second parameter nfds specifies the number of elements in the fds array, i.e., the maximum number of descriptors we want to monitor, breaking the 1024‑descriptor limit of select .
The timeout parameter sets the timeout value.
5. epoll
epoll is the most powerful multiplexing model among the three. It provides three functions: epoll_create , epoll_ctl , and epoll_wait .
5.1. Creating an epoll instance
<code>int epoll_create(int size); // size is ignored after Linux 2.6.8, must be >0 for compatibility
int epoll_create1(int flags); // flags can include EPOLL_CLOEXEC</code>epoll_create() creates an epoll instance and returns a descriptor that represents the instance. Internally, the epoll instance maintains two important structures: a tree of file descriptors to monitor and a list of ready file descriptors.
5.2. Registering events with epoll_ctl
<code>int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);</code>op can be EPOLL_CTL_ADD (register), EPOLL_CTL_DEL (remove), or EPOLL_CTL_MOD (modify). The event field contains a bitmask of events (e.g., EPOLLIN, EPOLLOUT) and a user data union.
5.3. Waiting for events with epoll_wait
<code>int epoll_wait(int epfd, struct epoll_event *events, int maxevents, int timeout);</code>The call blocks until at least one monitored descriptor becomes ready, then fills the events array with the ready descriptors and their event masks.
5.4. Edge‑triggered vs. Level‑triggered
Adding the EPOLLET flag makes epoll edge‑triggered. In edge‑triggered mode, the kernel notifies the process only when the state changes (e.g., new data arrives), whereas level‑triggered mode repeatedly notifies as long as the condition holds.
5.5. Internals of epoll
Only file types that implement a poll method in their file_operations structure can be monitored by epoll. Sockets implement sock_poll() , which is why they can be used with epoll.
When epoll_create1() is called, the kernel allocates a struct eventpoll object, creates an anonymous file [eventpoll] , stores a pointer to the eventpoll object in the file's private_data , and returns a file descriptor that refers to this anonymous file.
Adding a socket to an epoll instance involves creating an epitem structure, initializing it, setting up a poll_table with the callback ep_ptable_queue_proc , and inserting the epitem into the epoll instance's red‑black tree. The callback ultimately registers ep_poll_callback in the socket's wait queue ( sk_wq ).
When data arrives, the network card triggers a hardware interrupt, which quickly hands control to a kernel thread ( ksoftirqd ). The kernel thread eventually calls the socket's poll method, which adds the waiting queue entry (containing ep_poll_callback ) to the socket's wait queue.
ep_poll_callback runs in the context of the kernel thread, adds the corresponding epitem to the epoll instance's ready list ( rdllist ), and wakes up any process waiting on the epoll instance's wait queue ( wq ) via default_wake_function .
Finally, epoll_wait checks the ready list; if it is empty, it creates a wait‑queue entry for the current process, adds it to the epoll instance's wq , and puts the process to sleep. When ep_poll_callback wakes the process, epoll_wait returns the list of ready descriptors to user space.
Recursive monitoring is supported: an epoll instance can monitor another epoll instance. The inner epoll instance's poll_wait queue holds the outer epoll instance, so when the inner instance becomes ready, it notifies the outer instance, which then reports the readiness to the user process.
In summary, epoll creates an eventpoll object with a ready list, a wait queue, and a red‑black tree. epoll_ctl registers sockets (or other epoll instances) by inserting epitem nodes into the tree and linking callbacks into the socket's wait queue. epoll_wait either returns ready descriptors immediately or blocks the process on the epoll wait queue. When a socket becomes ready, the kernel invokes ep_poll_callback , which moves the corresponding epitem to the ready list and wakes the waiting process via default_wake_function . This design avoids scanning all descriptors on each poll and scales efficiently to large numbers of connections.
Sanyou's Java Diary
Passionate about technology, though not great at solving problems; eager to share, never tire of learning!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.