Understanding MySQL Indexes: Data Structures, Algorithms, and Optimization Strategies
This article examines MySQL indexing by exploring underlying data structures such as B‑Tree and B+Tree, common search algorithms, storage engine differences, and practical optimization techniques like composite and prefix indexes, providing a comprehensive guide for designing efficient database indexes.
Abstract This article focuses on MySQL databases and discusses index‑related topics, concentrating on B‑Tree indexes as the most commonly used structure in MySQL.
Common Query Algorithms and Data Structures Indexes are essentially data structures that enable efficient search algorithms. The article reviews linear (sequential) search, binary search, binary‑tree search, hash tables, and block (indexed sequential) search, describing their applicable data structures and time complexities.
2.1 The Essence of an Index MySQL defines an index as a data structure that helps MySQL retrieve data efficiently; in other words, an index is a data structure.
2.2 Common Search Algorithms
2.2.1 Linear Search
Data structure: ordered or unordered queue
Complexity: O(n)
Example code:
// sequential search
int SequenceSearch(int a[], int value, int n)
{
int i;
for(i=0; i
if(a[i]==value)
return i;
return -1;
}2.2.2 Binary Search
Data structure: ordered array
Complexity: O(log n)
Example code (recursive version):
// binary search, recursive version
int BinarySearch2(int a[], int value, int low, int high)
{
int mid = low + (high - low) / 2;
if(a[mid]==value) return mid;
if(a[mid]>value) return BinarySearch2(a, value, low, mid-1);
if(a[mid]
}2.2.3 Binary‑Tree Search
Data structure: binary search tree
Complexity: O(log₂ N)
Search principle described in bullet points (empty tree → failure; compare with root → left/right subtree).
2.2.4 Hash Table
Data structure: hash table
Complexity: ~O(1) depending on collisions
2.2.5 Block Search (Indexed Sequential Search)
Data is divided into blocks; each block has a maximum key forming an index table. Search first the index (binary or sequential) to locate the block, then perform sequential search within the block.
2.3 Balanced Multi‑Way Search Trees (B‑Tree)
B‑Tree is a balanced multi‑way search tree. Each node contains between ⌈d/2⌉ and d‑1 keys (where d is the node degree) and the keys are ordered. All leaf nodes reside at the same depth.
Search algorithm (pseudo‑code):
BTree_Search(node, key) {
if(node == null) return null;
foreach(node.key) {
if(node.key[i] == key) return node.data[i];
if(node.key[i] > key) return BTree_Search(point[i]->node);
}
return BTree_Search(point[i+1]->node);
}
data = BTree_Search(root, my_key)B‑Tree properties such as height h ≤ log_d((N+1)/2) give search I/O complexity O(log_d N). Insert/delete operations require node split/merge to maintain balance.
2.3.2 B+Tree
B+Tree differs from B‑Tree in that internal nodes store only keys (no data), while leaf nodes store the full records. This allows a larger fan‑out and better disk‑I/O performance. A variant with sequential access pointers adds a pointer from each leaf to its right‑hand neighbor, improving range queries.
3. Computer‑Architecture Principles Behind Indexes
Indexes reside on both main memory (RAM) and external storage (disk). Because disk I/O involves mechanical movement (seek time, rotational latency, transfer time), minimizing the number of disk accesses is crucial. Modern implementations align each tree node with a disk page so that one I/O loads an entire node.
4. Performance Analysis of B‑/B+Tree Indexes
For B‑Tree, a search touches at most h‑1 nodes (root often cached), each requiring one I/O, yielding O(log_d N) I/O cost. With typical fan‑out d > 100, height h is usually ≤ 3, so most lookups need only 2–3 disk reads.
For B+Tree, the larger fan‑out (because internal nodes lack data) further reduces height and I/O. The maximum fan‑out is calculated as d_max = floor(pageSize / (keySize + dataSize + pointerSize)).
5. MySQL Index Implementations
5.1 MyISAM
MyISAM uses B+Tree indexes; leaf nodes store the address of the data record. Primary and secondary indexes share the same structure, differing only in uniqueness constraints.
5.2 InnoDB
InnoDB also uses B+Tree, but the table’s data file itself is a clustered B+Tree where leaf nodes contain the full row data. Primary keys become the clustered index. Secondary indexes store the primary key value (not the row address), requiring a second lookup to fetch the full row.
Understanding these differences helps choose appropriate primary keys (e.g., avoid long or non‑monotonic keys) and design efficient secondary indexes.
6. Index Usage Strategies and Optimization
6.1 Composite Indexes and the Left‑most Prefix Rule
A composite (multi‑column) index orders rows by the first column, then the second, etc. Queries can use the index if they filter on a left‑most prefix of the index columns (e.g., a, a+b, a+b+c). Order of conditions does not matter; the optimizer will reorder them.
Example SQL snippets are provided to illustrate usable and unusable cases.
6.2 Prefix Indexes
For long string columns, a prefix index stores only the first N characters, reducing index size while retaining selectivity when the prefix is sufficiently discriminative.
Guidelines include using prefix indexes on VARCHAR/TEXT columns with high prefix selectivity, avoiding them on columns where the prefix is not selective, and noting that prefix indexes cannot be used for ORDER BY, GROUP BY, or covering indexes.
6.3 General Optimization Tips
Follow the left‑most prefix principle.
Index foreign keys.
Index columns used in WHERE, ON, GROUP BY, ORDER BY.
Prefer high‑cardinality columns.
Prefer short columns for indexes.
Avoid functions on indexed columns in predicates.
Use prefix indexes for long strings.
Reuse existing indexes instead of creating many new ones.
Balance the number of indexes against DML overhead.
For LIKE queries, avoid leading wildcards.
Ensure data type compatibility in predicates.
Sample queries demonstrate index usage versus non‑usage (e.g., LIKE 'prefix%' vs. LIKE '%substring%').
References to several online articles are listed at the end of the original source.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.