Databases 20 min read

Understanding InnoDB MVCC, Undo Logs, ReadView, and Purge Mechanisms

This article explains InnoDB's Multi-Version Concurrency Control (MVCC) architecture, detailing row formats, undo log types, ReadView visibility rules, purge processes, and related concepts such as semi-consistent reads and index condition pushdown, with code examples and diagrams illustrating each mechanism.

Tencent Database Technology
Tencent Database Technology
Tencent Database Technology
Understanding InnoDB MVCC, Undo Logs, ReadView, and Purge Mechanisms

1. Overview

InnoDB supports MVCC (Multi-Version Concurrency Control); the undo log stores multiple versions of records, enabling both transaction rollback and consistent reads. Undo logs are kept in rollback segments and are reclaimed by purge operations.

InnoDB row records contain transaction-related information such as transaction ID and roll_ptr. The ID is used for visibility checks, while roll_ptr points to the undo record for reconstructing previous versions. Consistent reads create a ReadView that captures the state of active transactions and uses it to determine visible records.

MVCC's key feature is that consistent reads do not acquire locks, allowing reads to proceed without blocking updates and improving concurrency.

2. InnoDB Row Format

InnoDB row format (diagram) shows that rows in the clustered index contain DB_TRX_ID and DB_ROLL_PTR fields:

DB_TRX_ID stores the transaction ID (trx->id) for visibility checks.

DB_ROLL_PTR stores the address of the rollback segment (spaceid, pageno, offset) for tracing the previous version.

Example table definition:

create table t1(c1 int primary key, c2 int, c3 char(10), index i_c3(c3));
insert into t1 values(1,1,’a’);

Note: Tables without a primary key automatically receive a DB_ROW_ID column to replace the PK.

3. Undo

Undo logs are stored in rollback segments, which reside in ibdata or a separate undo tablespace.

The main undo record types are:

#define TRX_UNDO_INSERT_REC 11 /* fresh insert into clustered index */
#define TRX_UNDO_UPD_EXIST_REC        \
  12 /* update of a non-delete-marked \
     record */
#define TRX_UNDO_UPD_DEL_REC                \
  13 /* update of a delete marked record to \
     a not delete marked record; also the   \
     fields of the record can change */
#define TRX_UNDO_DEL_MARK_REC              \
  14 /* delete marking of a record; fields \
     do not change */

For inserts and deletes, undo records keep the key values; delete operations are marked. For updates performed in place, undo stores both the key and the old value. If an update is implemented as delete+insert, only the key is recorded, and the delete is a delete‑mark.

Secondary‑index updates are always performed as delete+insert. See trx_undo_report_row_operation for the exact log format.

4. ReadView

Each InnoDB transaction has a corresponding ReadView that records the set of active transactions and is used for visibility decisions.

Visibility is divided into three zones relative to the ReadView's dashed line:

Active transactions intersecting the line are invisible.

Transactions committed before the line are visible.

Transactions not yet started after the line are invisible.

In InnoDB, both read and write transactions receive an incrementing trx_id . The structure trx_sys->rw_trx_ids holds active transaction IDs.

m_ids : list of currently executing transaction IDs (active, invisible).

m_up_limit_id : IDs less than this are committed and visible.

m_low_limit_id : IDs greater or equal to this are not yet started and invisible.

trx_id::id : assigned from trx_sys->max_trx_id for each transaction.

Example ReadView values:

m_ids: (6)
m_up_limit_id: 6
m_low_limit_id: 12

According to the visibility rules, transactions T1, T2, T3, and T5 are visible, while T6 is not.

Clustered‑index visibility check code:

bool changes_visible(trx_id_t id, const table_name_t &name) const
    MY_ATTRIBUTE((warn_unused_result)) {
  ut_ad(id > 0);

  if (id < m_up_limit_id || id == m_creator_trx_id) {
    return (true);
  }

  check_trx_id_sanity(id, name);

  if (id >= m_low_limit_id) {
    return (false);

  } else if (m_ids.empty()) {
    return (true);
  }

  const ids_t::value_type *p = m_ids.data();

  return (!std::binary_search(p, p + m_ids.size(), id));
}
In READ COMMITTED isolation, a new ReadView is created before each statement; in REPEATABLE READ, a single ReadView is created at transaction start.

5. Purge

Purge reclaims undo logs and permanently deletes records that were only delete‑marked.

5.1 Undo Reclamation

Undo stores historical versions of rows; when they are no longer needed, purge removes them.

In InnoDB, trx->no records the commit order of a transaction, obtained from trx_sys->max_trx_id at commit time. In a ReadView, m_low_limit_no is the highest trx->no of committed transactions; undo logs older than this can be purged.

trx_sys->mvcc->m_views holds all current ReadViews; the oldest_view determines the lowest m_low_limit_no that is safe to purge.

Only undo logs with trx->no less than the oldest view's m_low_limit_no are eligible for purge.

After a transaction commits, its undo information (stored in trx_rseg_t ) is placed into the purge_queue , a min‑heap ordered by trx->id .

The purge thread extracts entries from purge_queue that satisfy trx->id < oldest_view->m_low_limit_no and reclaims them.

5.2 Deleting Records

Even after commit, delete‑marked records may remain until purge physically removes them.

Timely purge of undo logs prevents rollback segment growth and improves query performance by eliminating lingering delete‑mark records.

6. Multi‑Version

InnoDB multi‑version data consists of delete‑marked rows together with undo information.

Example: a row with three versions; under REPEATABLE READ, SELECT returns the oldest version (1,1,'a').

Clustered index history is stored in undo logs or as delete‑marked rows; secondary index history is stored as delete‑marked rows.

Illustration of three versions for table t1 :

Latest version T3 (1,5,roll_ptr,1,'c') with transaction ID 5.

Previous version T2 (1,3,roll_ptr,1,'b') reachable via T3's roll_ptr.

Oldest version T1 (1,1,roll_ptr,1,'a') reachable via T2's roll_ptr.

In secondary indexes, the latest version is stored directly, while older versions appear as delete‑marked entries.

7. Visibility Determination

How to obtain the correct version differs between clustered and secondary indexes.

7.1 Clustered Index

With REPEATABLE READ, SELECT * FROM t1 returns the oldest visible version (1,1,'a') based on the ReadView:

ReadView
m_ids: (null)
m_up_limit_id: 2
m_low_limit_id: 2

The engine first reads the newest record (trx_id 5) which is invisible (id > m_low_limit_id). It then follows roll_ptr to previous versions, eventually reaching a record with trx_id 1 that satisfies id < m_up_limit_id, making it visible.

7.2 Secondary Index

Secondary index pages store the maximum transaction ID that touched the page. Visibility is decided by comparing this page ID with the ReadView's m_up_limit_id ; if the page ID is less, the record is visible, otherwise the engine falls back to the clustered index.

Example query using FORCE INDEX(i_c3) where c3 >= 'a' demonstrates this process, ultimately returning the row (1,1,'a').

When the page's max transaction ID equals the view's m_up_limit_id , the record is considered invisible and the engine checks the clustered index for a visible version.

8. ICP and MVCC

Index Condition Pushdown (ICP) is not affected by MVCC; visibility checks are performed before ICP optimization. If a record is visible, ICP proceeds; otherwise the record is ignored.

create table t1(c1 int primary key,c2 int,c3 int, index idx(c2,c3));

select * from t1 force index(idx) where c2>1 and c3=2;

9. Semi‑consistent Read

Semi‑consistent read leverages InnoDB's multi‑version architecture to avoid lock waiting on updates. In READ COMMITTED or when innodb_locks_unsafe_for_binlog is enabled, if an update encounters a row locked by another session, it reads the most recent historical version; if that version satisfies the WHERE clause, the row is re‑read and locked, otherwise it is skipped.

9.1 Analysis

Under REPEATABLE READ, semi‑consistent reads do not occur; a concurrent update will wait. Under READ COMMITTED, semi‑consistent reads allow the second session to proceed, possibly returning zero rows if the historical version does not meet the condition.

If the WHERE condition matches the historical version, the second session will wait for the lock after re‑reading the row.

Implementation details can be found in try_semi_consistent_read / was_semi_consistent_read .

9.2 Issues

Semi‑consistent reads can break serializability in certain scenarios, leading to results that do not correspond to any serial execution order.

select * from t2;
(1,2,), (10,30)

Different interleavings of two sessions produce divergent final states, demonstrating the anomaly.

In REPEATABLE READ, this problem does not occur; therefore REPEATABLE READ is recommended for high‑integrity workloads.

10. Related Bug

Bug #84958 (https://bugs.mysql.com/bug.php?id=84958) described a performance issue when checking visibility of rows scanned via a secondary index. The fix caches clustered‑index cursor records to avoid unnecessary back‑tracking. The fix landed in MySQL 8.0.13 (see commit 0ca968f ).

11. References

https://dev.mysql.com/doc/refman/8.0/en/innodb-multi-versioning.html

https://dev.mysql.com/doc/refman/8.0/en/innodb-consistent-read.html

https://dev.mysql.com/doc/refman/8.0/en/glossary.html#glos_semi_consistent_read

InnoDBconsistencyundo logMVCCPurgeReadView
Tencent Database Technology
Written by

Tencent Database Technology

Tencent's Database R&D team supports internal services such as WeChat Pay, WeChat Red Packets, Tencent Advertising, and Tencent Music, and provides external support on Tencent Cloud for TencentDB products like CynosDB, CDB, and TDSQL. This public account aims to promote and share professional database knowledge, growing together with database enthusiasts.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.