InnoDB Buffer Pool Architecture, Data Structures, and Page Lifecycle
This article provides a comprehensive overview of InnoDB's buffer pool, detailing its role as a data cache, the underlying data structures such as instances, chunks, and blocks, the page lifecycle from allocation to flushing, and discusses limitations of the default page‑cleaner implementation along with Percona's enhancements.
Note: Code displayed in public accounts may wrap automatically; horizontal scrolling is recommended for readability
1. Overview
The buffer pool is InnoDB's data cache, storing data pages, index pages, undo pages, insert buffer pages, adaptive hash indexes, data dictionary entries, and lock information. Most pages are data pages (including index pages). InnoDB also has a log buffer for redo logs.
The diagram below shows the position of the InnoDB buffer pool within MySQL.
All read/write operations on data pages go through the buffer pool. Writes first place data and logs into the buffer pool and log buffer, then background threads flush them to disk. Reads fetch pages from disk into the buffer pool if they are not already present, possibly evicting older pages. Redo logs guarantee transaction durability; the buffer pool exists solely to improve I/O efficiency. For optimal performance, the buffer pool should be as large as possible while leaving enough memory for MySQL's normal operation.
The buffer pool code resides mainly in the following files:
folder path: storage/innobase/buf
file list:
buf0buddy.cc (880 lines): Binary buddy allocator for compressed pages
buf0buf.cc (7050 lines): The database buffer buf_pool
buf0checksum.cc (159 lines): Buffer pool checksum functions, also linked from /extra/innochecksum.cc
buf0dblwr.cc (1289 lines): Doublwrite buffer module
buf0dump.cc (810 lines): Implements a buffer pool dump/load.
buf0flu.cc (3838 lines): The database buffer buf_pool flush algorithm
buf0lru.cc (2786 lines): The database buffer replacement algorithm
buf0rea.cc (902 lines): The database buffer read2. Data Structures
Logically, the InnoDB buffer pool is organized into three levels: instance, chunk, and block, corresponding to the structs buf_pool_t , buf_chunk_t , and buf_block_t .
buf_pool_t
The number of instances is controlled by innodb_buffer_pool_instances . Each instance receives innodb_buffer_pool_size / innodb_buffer_pool_instances bytes at startup and releases it on shutdown. A page’s instance is determined by page_id >> 6 % instance_count .
LRU List
The LRU list evicts rarely used pages. It is split into an old and a new part; innodb_old_blocks_pct controls the size of the old part, and innodb_old_blocks_time determines how long a newly moved page stays in the old part before being promoted. This design prevents freshly accessed pages from monopolizing the buffer pool during low‑frequency full‑table scans. Certain pages (adaptive hash, lock info, etc.) are allocated from the free list but are not placed on the LRU list.
Hazard Point
Accelerates reverse logical chain traversal.
Free List
The free list holds free pages for allocation. If the list is empty when a page is requested, dirty pages must be flushed to create new free pages.
Flush List
Contains dirty pages that have been modified but not yet flushed to disk. All pages in the Flush List are also present in the LRU list, but not vice‑versa. Each page records the oldest modification LSN ( oldest_modification ). The Flush List is ordered by this LSN, with the tail holding the oldest dirty pages, which are flushed first. Modifications to the Flush List are protected by flush_list_mutex .
page_hash
Provides O(1) lookup of pages by page ID. It is a simple hash table with table_size buckets, each protected by a read‑write lock. The hash value is computed as (key ^ UT_HASH_RANDOM_MASK2) % table_size , with collisions resolved by chaining.
buf_chunk_t and buf_block_t
Chunk size is defined by innodb_buffer_pool_chunk_size . Introducing chunks allows online resizing of the buffer pool in chunk‑sized increments.
buf_block_t begins with a buf_page_t header (metadata) followed by a frame pointer to the actual data page. The state field of buf_page_t can be:
BUF_BLOCK_NOT_USED – page is on the free list and unused.
BUF_BLOCK_READY_FOR_USE – page has been taken from the free list and is ready for use.
BUF_BLOCK_FILE_PAGE – page is actively used and resides in the LRU list.
Memory allocation logic can be found in the function buf_pool_init (file buf0buf.cc ).
3. Lifecycle of a Buffer Pool Page
During instance startup, the three‑level structure (instance → chunk → block) is allocated and all pages are placed on the free list.
Page reads are performed via Mini‑transactions, starting with buf_page_get_gen . The function first looks up the page in page_hash . If missing, the page is read from disk (step 3). If the page resides in the LRU old part, it is moved to the new part. The page is then locked according to the requested latch mode and returned.
If the page is not in the buffer pool, buf_read_page asynchronously reads it. buf_read_page_low calls buf_page_init_for_read to allocate a buffer, then uses fil_io to read the data. If the first read fails, a pre‑read ( buf_read_ahead_random ) fetches a batch of pages to reduce subsequent I/O.
buf_page_init_for_read invokes buf_LRU_get_free_block (holding buf_pool->mutex ) to obtain a free page from the free list; if unavailable, it scans the LRU list for a flushed dirty page to recycle. If still unsuccessful, buf_flush_single_page_from_LRU forces a flush of the oldest dirty page, which can be costly. To avoid duplicate reads, the thread acquiring a free block performs: Lock buf_pool->mutex and the page‑hash X lock. Re‑check page_hash for the page (another thread may have loaded it). Lock the block’s mutex and insert the block into page_hash . Set IO_FIX to BUF_IO_READ . Release the hash lock. Insert the block into the LRU old part. Acquire an X lock of type BUF_IO_READ and wait for the I/O handler thread to complete the read (the handler clears the lock via buf_page_io_complete ).
When a page is modified, it becomes a dirty page. The first modification adds the page to the Flush List; subsequent modifications do not re‑insert it. The field oldest_modification records the LSN of the first change; a value of 0 indicates the page has not yet been marked dirty.
Dirty pages are flushed by page‑cleaner threads. The coordinator thread wakes worker threads based on the recommended number of pages to flush (computed by page_cleaner_flush_pages_recommendation ) and signals them via os_event_set . Each instance has its own page_cleaner_slot_t tracking flush state. Workers first flush the LRU list ( buf_do_LRU_batch ) and then the Flush List ( buf_do_flush_list_batch ). Only one thread may batch‑flush a given list at a time; concurrent attempts are rejected. Coordinator sleep conditions include active database workload or when the next scheduled flush time has not arrived.
buf_do_LRU_batch walks the LRU list backwards, calling buf_flush_ready_for_replace . If a page satisfies (bpage->oldest_modification == 0 && bpage->buf_fix_count == 0 && buf_page_get_io_fix(bpage) == BUF_IO_NONE) , it can be freed; otherwise buf_flush_ready_for_flush determines if it can be flushed, and buf_flush_page_and_try_neighbors performs the flush.
return (bpage->oldest_modification == 0 &&
bpage->buf_fix_count == 0 &&
buf_page_get_io_fix(bpage) == BUF_IO_NONE)Explanation: oldest_modification == 0 means the block has not been modified; buf_fix_count == 0 indicates no active operations; buf_page_get_io_fix(bpage) == BUF_IO_NONE means the page is idle.
if (bpage->oldest_modification == 0
|| buf_page_get_io_fix(bpage) != BUF_IO_NONE) {
return false;
}Here, a non‑zero oldest_modification means the page is dirty and needs flushing; a non‑none I/O fix means the page is currently in use and cannot be flushed.
4. Limitations of the Official Page‑Cleaner and Percona Improvements
The default page‑cleaner threads have two main issues:
The LRU List flushing occurs before the Flush List flushing, and the two operations are mutually exclusive; while the Flush List is being flushed, the LRU List cannot proceed.
The coordinator thread waits for all page‑cleaner threads to finish before handling new flush requests, causing delays when a particular buffer‑pool instance is hot.
Percona's enhancements address these problems:
Decouple LRU List and Flush List flushing, allowing both lists to be flushed in parallel.
Remove synchronization constraints between buffer‑pool instances; each instance independently decides when to flush.
Tencent Database Technology Team supports internal services such as QQ Space, WeChat Red Packets, Tencent Ads, Tencent Music, and Tencent News, and external products on Tencent Cloud like CynosDB, CDB, CTSDB, and CMongo. The team focuses on kernel and architecture optimization to improve performance and stability for both internal services and cloud customers.
Tencent Database Technology
Tencent's Database R&D team supports internal services such as WeChat Pay, WeChat Red Packets, Tencent Advertising, and Tencent Music, and provides external support on Tencent Cloud for TencentDB products like CynosDB, CDB, and TDSQL. This public account aims to promote and share professional database knowledge, growing together with database enthusiasts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.