Databases 13 min read

Understanding InnoDB IO Subsystem: Threads, Asynchronous Requests, and Prefetch Mechanisms

This article explains the architecture and code flow of InnoDB's IO subsystem, covering read/write threads, asynchronous AIO handling, concurrency control, file prefetch strategies, and log write padding, with detailed examples of the underlying C++ structures and functions.

Architect
Architect
Architect
Understanding InnoDB IO Subsystem: Threads, Asynchronous Requests, and Prefetch Mechanisms

Preface

InnoDB separates page disk operations into read and write actions. For reads, a block is allocated before loading a page, and change‑buffer entries are checked to apply pending modifications. Reads occur as synchronous normal reads or asynchronous prefetch reads, while writes can be batch or single‑page writes, protected by the double‑write buffer (synchronous to the buffer, asynchronous to data files).

Background Threads

IO READ threads – number configured by innodb_read_io_threads ; handle asynchronous reads from the data files using the os_aio_read_array queue (threads × 256 slots on Linux).

IO WRITE threads – number configured by innodb_write_io_threads ; handle asynchronous writes via os_aio_write_array (same slot calculation as reads).

LOG thread – writes redo logs; a single segment with 256 slots is used when a checkpoint is flushed.

IBUF thread – processes change‑buffer pages; uses os_aio_ibuf_array (one segment, 256 slots).

All synchronous writes are performed by user threads or other background threads; the above threads only handle asynchronous operations.

Issuing Requests

The entry point is os_aio_func . For synchronous requests (OS_AIO_SYNC) the calling thread directly invokes os_file_read_func or os_file_write_func . For asynchronous requests the user thread reserves a slot in the appropriate queue ( os_aio_array_reserve_slot ) and fills it with the operation details.

Example slot preparation (C++ code):

local_seg = (offset >> (UNIV_PAGE_SIZE_SHIFT + 6)) % array->n_segments;
slot->is_reserved = true;
slot->reservation_time = ut_time();
slot->message1 = message1;
slot->message2 = message2;
slot->file = file;
slot->name = name;
slot->len = len;
slot->type = type;
slot->buf = static_cast<byte*>(buf);
slot->offset = offset;
slot->io_already_done = false;

For native AIO the code prepares an iocb structure:

aio_offset = (off_t) offset;
ut_a(sizeof(aio_offset) >= sizeof(offset) || ((os_offset_t) aio_offset) == offset);
iocb = &slot->control;
if (type == OS_FILE_READ) { io_prep_pread(iocb, file, buf, len, aio_offset); } else { ut_a(type == OS_FILE_WRITE); io_prep_pwrite(iocb, file, buf, len, aio_offset); }
iocb->data = (void*) slot;

If native AIO is disabled, the system wakes up simulated AIO threads (e.g., os_aio_simulated_wake_handler_thread ).

Processing Asynchronous AIO Requests

The I/O thread entry point is io_handler_thread → fil_aio_wait . For native AIO it calls os_aio_linux_handle and repeatedly polls with io_getevents (500 ms timeout) to collect completed tasks, then frees the slot via os_aio_array_free_slot . For simulated AIO it uses os_aio_simulated_handle , which may delay processing to improve request merging, select the oldest pending slot, group consecutive I/O operations, allocate a buffer, and issue a combined read or write.

After an I/O completes, fil_node_complete_io decrements node->n_pending . For data‑file writes the node is added to fil_system->unflushed_spaces unless O_DIRECT_NO_FSYNC is set. Further processing includes buf_page_io_complete (corruption checks, change‑buffer merge, LRU free‑list updates) and log_io_complete (flushing redo logs and updating checkpoint information).

Concurrency Control

On Linux, pwrite/pread allow concurrent file I/O without locks; on Windows explicit locking is required. When a tablespace is being extended, fil_node->being_extended is set to prevent concurrent extensions, drops, or renames. Deleting a table checks for pending operations via fil_check_pending_operations and may set fil_space_t::stop_new_ops . Similar checks are performed for truncate and rename operations, setting flags such as is_being_truncated or stop_ios to block further I/O.

File Prefetch

Prefetch reduces random I/O on spinning disks. InnoDB implements three strategies:

Random Prefetch

Entry: buf_read_ahead_random . When the number of recently accessed pages within a 64‑page extend exceeds BUF_READ_AHEAD_RANDOM_THRESHOLD (default 13), the remaining pages in that extend are read asynchronously. Controlled by innodb_random_read_ahead .

Linear Prefetch

Entry: buf_read_ahead_linear . When sequential page accesses exceed innodb_read_ahead_threshold , the next extend of 64 pages is read ahead in order.

Logical Prefetch

Introduced by Facebook to handle fragmented tables where physical pages are not contiguous. The engine scans the clustered index, collects leaf‑node page numbers, and asynchronously reads a batch of pages based on logical order. The implementation modifies InnoDB to submit multiple AIO requests in a single io_submit call (see commits 2d613294..., 9f52bfd2..., 64b68e07...).

Log Write Padding

Modern disks use block sizes larger than 512 bytes (commonly 4096 bytes). To avoid the read‑modify‑write penalty, MySQL 5.7 introduced innodb_log_write_ahead_size , which aligns redo‑log writes to the block size. The implementation pads the tail of the log file with zeros before writing (see log_write_up_to ).

Tips: The read‑on‑write problem occurs when a write modifies less than a full block, forcing the block to be read, modified, and written back; block‑aligned writes eliminate this overhead.

asynchronousInnoDBstorageioprefetchdatabase engine
Architect
Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.