Databases 34 min read

Master MySQL Indexes: B+Tree, Hash, Clustering & Optimization

This article provides a comprehensive guide to MySQL indexing, covering index types, data structures, clustering and non‑clustering indexes, hash indexes, B+Tree mechanics, index lookup, covering indexes, index push‑down, index merge, cost‑based index selection, common pitfalls, and best practices for designing effective indexes.

Sanyou's Java Diary
Sanyou's Java Diary
Sanyou's Java Diary
Master MySQL Indexes: B+Tree, Hash, Clustering & Optimization

Hello everyone, I am San You~~

Today we will review common MySQL index knowledge points, focusing on the InnoDB storage engine.

Index Classification

Indexes can be classified from different dimensions.

1. By underlying data structure

B+Tree index

Hash index

2. By physical storage layout

Clustered index

Non‑clustered index (secondary index)

Clustered and non‑clustered indexes will be discussed in detail later.

3. By index characteristics

Primary key index

Unique index

Normal index

Full‑text index

4. By number of columns

Single‑column index

Composite (multi‑column) index

Index Data Structures

Preparation

To illustrate the following content, we prepare a user table. All examples use this table.

<code>CREATE TABLE `user` (
  `id` int(10) NOT NULL AUTO_INCREMENT,
  `name` varchar(255) DEFAULT NULL,
  `age` int(10) DEFAULT NULL,
  `city` varchar(255) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;</code>

Hash Index

Hash indexes are rarely used because InnoDB does not support explicitly creating hash indexes; it only supports adaptive hash indexes.

Even if you declare a hash index in SQL, it is ineffective.

Creating a hash index on the name column and checking with SHOW INDEX FROM table_name still shows a B+Tree.

The Memory engine supports hash indexes.

Hash indexes are similar to Java's HashMap: they store key‑value pairs where the key is the indexed column and the value is a pointer to the row data.

Assume the user table uses the Memory engine and a hash index is created on name . After inserting three rows, the hash index hashes the name values to slots.

When two rows have the same hash value (hash collision), a linked list is formed.

To find name='李四' , compute its hash, locate the slot, traverse the list, retrieve the row pointer, and then fetch the row.

Advantages and disadvantages of hash indexes

Only supports equality comparisons, giving very high query efficiency.

Does not support range queries or sorting because the index entries are unordered.

B+Tree

B+Tree is the most widely used index structure in MySQL. Detailed discussion follows later.

Besides hash and B+Tree, there are full‑text indexes and others, which are not covered here.

Clustered Index

Data Page Storage

Inserted rows are persisted to disk. InnoDB divides data into pages (default 16KB), called data pages.

Rows are sorted by the primary key within a page, forming a singly linked list.

Besides row data, each page stores additional metadata; we will discuss these later.

Single‑page data lookup

To locate a row with id=2 in a page, a naive method is to scan from the beginning of the list.

Scanning many rows per page is inefficient, so MySQL groups rows.

Assume a page holds 12 rows; they are divided into groups of 4 (three groups). The maximum id of each group (4, 8, 12) is stored in a page directory.

When searching for id=6 , a binary search on the page directory narrows the range to the group containing ids 5‑8, then a linear scan within that group finds the target row.

The grouping shown is simplified for understanding.

Multi‑page data lookup

When a page fills up, a new page is created. Each page receives a page number, and pages are linked via a doubly linked list.

The page number and pointers to previous/next pages are stored in extra metadata.

MySQL ensures that the maximum id in a page is less than the minimum id in the next page, guaranteeing global ordering by id .

To find id=5 across multiple pages, MySQL first uses the page‑directory to locate the candidate page, then performs the same group‑based binary search inside that page.

Secondary Index

A secondary (non‑clustered) index is also a B+Tree, but its leaf nodes store only the indexed column(s) and the primary key id , not the full row data.

Although the underlying structure is the same, we refer to the leaf nodes of the primary index as data pages and those of secondary indexes as index pages.

Single‑column index

Creating a normal (non‑unique) index on name makes name the index column.

After inserting three rows, the leaf nodes of the name index contain the values sorted by name . When values are equal, rows are further ordered by id .

MySQL can sort Chinese characters as well, depending on the collation.

Lookup in a single‑column index follows the same grouping and binary‑search logic as the clustered index.

Composite index

A composite index on name and age stores both columns in the leaf nodes. Sorting is first by name , then by age , and finally by id if needed.

The resulting B+Tree has three levels (simplified here).

Summary of primary vs secondary indexes

Clustered index leaf nodes store all column values; secondary index leaf nodes store only indexed columns and the primary key.

Clustered index data is ordered by id ; secondary index data is ordered by the indexed column(s).

Clustered index non‑leaf nodes store id and page numbers; secondary index non‑leaf nodes store indexed column(s), id , and page numbers.

For convenience, the article includes SQL statements that populate the example table (see the end of the article).

Back‑Table (Row Lookup)

When a query uses a secondary index, MySQL first finds the matching id values in the secondary index, then retrieves the full row from the clustered index using those id s. This step is called “back‑table”.

<code>SELECT * FROM `user` WHERE name = '赵六';</code>

The process involves locating the secondary index leaf, obtaining the primary key, and then fetching the remaining columns from the clustered index.

Covering Index

If a query only requests columns that are present in the index (e.g., SELECT id FROM user WHERE name='赵六'; ), the engine can satisfy the query directly from the index without a back‑table lookup. This is a covering index and improves performance.

In practice, avoid SELECT * ; request only needed columns to increase the chance of using a covering index.

Index Push‑Down

With a composite index on name, age , a query like WHERE name > '王五' AND age > 22 can evaluate the age condition directly in the index (push‑down), reducing back‑table calls.

<code>SELECT * FROM `user` WHERE name > '王五' AND age > 22;</code>

Before MySQL 5.6, each matching row required a back‑table lookup. After 5.6, the age condition is evaluated in the index, so only rows that satisfy both conditions trigger a back‑table, dramatically reducing the number of lookups.

This optimization is called “index push‑down”.

Index Merge

MySQL can combine multiple indexes for a single query (introduced in MySQL 5.1). The merge can be an intersect, union, or sort‑union.

Intersect

For WHERE name='赵六' AND age=22 , MySQL retrieves primary keys from both idx_name and idx_age , intersects them, and then back‑tables the resulting rows.

Union

For WHERE name='赵六' OR age=22 , MySQL unions the primary key sets from the two indexes and removes duplicates before back‑table lookup.

Sort‑Union

If the individual index scans return unsorted primary keys, MySQL sorts them before performing the union.

How MySQL Chooses an Index

MySQL estimates the cost of each possible access path using I/O and CPU metrics.

I/O cost: loading a page costs 1.0.

CPU cost: evaluating a row costs 0.2.

Full‑Table Scan Cost

Cost = rows × 0.2 + (number_of_pages) × 1.0, where the number of pages is derived from data_length (bytes) divided by 16 KB.

Secondary Index + Back‑Table Cost

Cost = (number_of_index_ranges × 1.0) + (rows × 0.2) + (rows × 1.0) + (rows × 0.2). The first term accounts for reading index ranges, the second for scanning rows in those ranges, the third for back‑table page reads, and the fourth for CPU evaluation.

MySQL selects the path with the lowest total cost.

Cost calculations are simplified; real‑world scenarios (joins, complex predicates) are more involved.

Common Index Pitfalls

Violating the left‑most prefix rule (e.g., LIKE '%abc' or using a later column of a composite index first) causes the index to be ignored.

Applying functions or expressions to indexed columns prevents index usage.

Implicit type conversion (e.g., comparing a VARCHAR column to a numeric literal) can invalidate the index.

Stale statistics may mislead the optimizer; run ANALYZE TABLE to refresh.

Index Design Principles

Avoid excessive numbers of indexes; each index consumes disk space and adds maintenance overhead.

Index columns that appear frequently in WHERE clauses.

Consider indexing columns used in ORDER BY or GROUP BY when they match the query’s filtering order.

Do not index columns that are updated frequently, as index maintenance becomes costly.

Prefer high‑cardinality columns; low‑cardinality indexes (e.g., gender) may be less selective than a full scan.

Conclusion

This article covered MySQL index fundamentals, including B+Tree and hash structures, clustered vs. secondary indexes, back‑table lookups, covering indexes, index push‑down, index merge, cost‑based optimizer decisions, common reasons for index loss, and practical guidelines for designing effective indexes.

If you found this article helpful, please like, share, and follow for more content.

For any questions, scan the QR code below to contact me on WeChat.

SQL used to create the example data:

<code>INSERT INTO `user` (`id`,`name`,`age`,`city`) VALUES (1,'李四',20,'杭州');
INSERT INTO `user` (`id`,`name`,`age`,`city`) VALUES (2,'张三',18,'北京');
INSERT INTO `user` (`id`,`name`,`age`,`city`) VALUES (3,'张三',23,'上海');
INSERT INTO `user` (`id`,`name`,`age`,`city`) VALUES (4,'赵六',22,'杭州');
INSERT INTO `user` (`id`,`name`,`age`,`city`) VALUES (5,'王五',19,'北京');
INSERT INTO `user` (`id`,`name`,`age`,`city`) VALUES (6,'赵六',24,'上海');
INSERT INTO `user` (`id`,`name`,`age`,`city`) VALUES (7,'刘七',20,'上海');
INSERT INTO `user` (`id`,`name`,`age`,`city`) VALUES (8,'刘七',22,'上海');
INSERT INTO `user` (`id`,`name`,`age`,`city`) VALUES (9,'王九',9,'杭州');</code>
performanceOptimizationDatabaseMySQLIndex
Sanyou's Java Diary
Written by

Sanyou's Java Diary

Passionate about technology, though not great at solving problems; eager to share, never tire of learning!

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.