Databases 15 min read

Understanding MySQL and Elasticsearch Indexing Mechanisms

This article compares MySQL's B+ tree indexing with Elasticsearch's inverted index, explaining underlying data structures such as hash tables, ordered arrays, balanced binary trees, skip lists, and term dictionaries, and discusses optimization techniques like bitmap intersections for efficient query processing.

Java Captain

Oct 10, 2020

Understanding MySQL and Elasticsearch Indexing Mechanisms

Preface

While maintaining a product's search feature, I became curious about how Elasticsearch can achieve such high query efficiency, even surpassing MySQL primary‑key lookups.

It even feels faster than a local MySQL primary‑key query.

To investigate, I searched for related material and found many online answers summarised as follows:

Elasticsearch is built on Lucene, a full‑text search engine that tokenises data and stores an inverted index, which is better suited for large‑scale search than MySQL, which struggles with frequent updates and joins.

Since the discussion repeatedly mentions indexes, I will compare MySQL and Elasticsearch from the index perspective.

MySQL Index

Indexes are a classic space‑for‑time optimisation in MySQL, typically used to accelerate query scenarios.

The following content uses the InnoDB engine as an example.

Common Data Structures

If we were to design MySQL indexes ourselves, we could choose among several data structures.

Hash Table

Hash tables (e.g., Java's HashMap) provide O(1) lookup for exact keys such as id=3, but cannot efficiently handle range queries like 1 ≤ id ≤ 6 because they are unordered.

Ordered Array

Ordered arrays allow O(log n) lookup via binary search for exact keys (e.g., id=4) and naturally support range queries. However, inserting a value like id=2.5 requires shifting subsequent elements, leading to poor write performance.

Balanced Binary Tree

Balanced binary trees (e.g., AVL or Red‑Black trees) provide O(log n) lookup and insertion. For a query like id=11, the path 10 → 12 → 11 is traversed. However, range queries still require traversing both sub‑trees, which is less efficient.

Skip List

Skip lists, used in Redis sorted sets, augment an ordered linked list with multiple levels of forward pointers, enabling near‑logarithmic search. A query for id=13 might traverse 1 → 7 → 10 → 13, and range queries are supported by scanning from the start node to the end node.

Optimization of the Balanced Tree – B+ Tree

MySQL InnoDB actually uses a B+ tree rather than a plain skip list. In a B+ tree, only leaf nodes store the actual data; internal nodes store only keys as indexes, keeping all leaf nodes sorted. This design enables efficient range queries by locating the start leaf and scanning sequentially.

Because the index file resides on disk, minimizing disk I/O is crucial. The height of the tree directly influences the number of I/O operations; a shorter tree (e.g., a 3‑ary tree) reduces I/O.

This is essentially how a B+ tree originates.

Practical Indexing Tips

To keep the B+ tree efficient, primary keys should be inserted in increasing order. Random insertion may require moving existing leaf nodes, whereas monotonic inserts simply append.

Therefore, auto‑increment primary keys are generally recommended when sharding is not a concern.

Elasticsearch Index

Now let's examine how Elasticsearch (built on Lucene) handles indexing.

Forward Index

Elasticsearch also maintains a forward index (often called a doc‑ID lookup) that maps a document ID to the stored fields, similar to a hash table.

Essentially, it is a key‑to‑value lookup.

Inverted Index

To find all documents where a field (e.g., name) contains a term (e.g., li), Elasticsearch builds an inverted index. The term points to a posting list of document IDs that contain the term.

Term Dictionary

The term dictionary stores all unique terms in sorted order, enabling binary‑search (O(log n)) lookup of a term.

Tokenisation (splitting text into terms) creates the dictionary; for large corpora the dictionary may be too big for memory, so it is stored on disk with auxiliary indexes.

English tokenisation is simple (split on whitespace/punctuation); Chinese requires more sophisticated algorithms.

Term Index (Trie)

A trie (dictionary tree) can index the term dictionary, allowing fast in‑memory lookup of term prefixes.

When searching for terms starting with j, the term index yields the range of positions in the term dictionary, after which a binary search locates the exact term and its posting list.

Further Optimisations

Elasticsearch can intersect posting lists using bitmaps. For a query like name=li AND age=18, each field yields a bitmap; a bitwise AND quickly produces the final result set.

Bitmaps also reduce storage space and I/O compared to scanning two posting lists separately.

Newer ES versions further compress posting lists; the exact compression algorithms are documented in the official ES manual.

Conclusion

Complex products ultimately rely on fundamental data structures that are tuned for specific scenarios. Mastering these basics—hash tables, ordered arrays, balanced trees, skip lists, B+ trees, and inverted indexes—enables rapid adoption and optimisation of new technologies such as Elasticsearch.

My next step is to implement a single‑node search engine following ES's inverted‑index design to deepen my understanding.

Your likes and views are the greatest support for me!

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

mysql B+Tree DataStructures InvertedIndex

Written by

Java Captain

Focused on Java technologies: SSM, the Spring ecosystem, microservices, MySQL, MyCat, clustering, distributed systems, middleware, Linux, networking, multithreading; occasionally covers DevOps tools like Jenkins, Nexus, Docker, ELK; shares practical tech insights and is dedicated to full‑stack Java development.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.