String Matching Algorithms: Brute‑Force, KMP, Boyer‑Moore, BMH, Sunday, Shift‑And/Or, and Bitmap
This article introduces several classic string‑matching algorithms—including Brute‑Force, KMP, Boyer‑Moore, BMH, Sunday, Shift‑And/Or, and bitmap techniques—explaining their principles, preprocessing steps, and practical considerations for improving text search performance.
When validating forms we often need to check whether a string contains a certain substring, such as an email containing "indexOf". The indexOf method was introduced in ES3.1 and works for both strings and arrays. Interviewers frequently ask how the array version is implemented, but rarely the string version because it is considerably more complex.
The difficulty of implementing indexOf for strings stems from prefix‑suffix relationships, which can be modeled with prefix trees (tries) or suffix trees. Understanding these concepts is essential for writing an efficient implementation.
The simplest approach is the Brute‑Force algorithm (sometimes humorously called the "boyfriend" algorithm). It treats the longer string as the target and the shorter one as the pattern.
The idea is to compare the first character of the target with the first character of the pattern; if they match, continue character‑by‑character comparison, resetting the pattern index if a mismatch occurs.
Brute‑Force Algorithm
KMP Algorithm
The famous Knuth‑Morris‑Pratt (KMP) algorithm can match any pattern against a target in linear time by preprocessing the pattern to build a "next" (or "nextval") table.
For a detailed explanation of KMP’s inner workings, see the linked article.
http://blog.csdn.net/qq_29501587/article/details/52075200
Constructing the next and nextval tables is explained in the summary below.
KMP can be viewed as a degenerated deterministic finite automaton (DFA); it was the first algorithm to break the barrier of slow string matching.
Boyer‑Moore Algorithm
Boyer‑Moore is widely used in text editors for diff operations and is about three to four times faster than KMP. It also preprocesses the pattern to build a bad‑character table and a good‑suffix table.
Ruanyifeng’s article explains the bad‑character rule and the good‑suffix rule, though it does not provide full implementation details.
http://www.ruanyifeng.com/blog/2013/05/boyer-moore_string_search_algorithm.html
The bad‑character rule is relatively easy to implement; the good‑suffix rule can be O(n²) or O(n³) and is harder to understand.
For advanced single‑pattern matching algorithms, the self‑containment of substrings (prefixes/suffixes) is a crucial concept that accelerates matching; KMP popularized this idea, while BM leverages suffix self‑containment. The bad‑character table can speed up any single‑pattern matcher, not only BM; for large character sets, it is stored in a dictionary that maps characters to shift distances.
BMH Algorithm
BMH is a simplification of Boyer‑Moore that discards the complex good‑suffix rule and relies solely on the bad‑character heuristic, achieving better average performance.
Algorithm idea:
Search the text from right to left.
When a mismatch occurs, shift the pattern according to the bad‑character table and continue.
The shift amount is computed by the shift_table shown in the figure below.
Here k is the largest index such that Pattern[k] == Text[i+m-1] .
If no character matches, shift the pattern by its full length m .
If the pattern matches completely, return the position in the text.
If the end of the text is reached without a full match, return -1 .
Sunday Algorithm
Sunday’s method is similar to Boyer‑Moore but, on a mismatch, looks at the character immediately following the current window in the text; if that character does not appear in the pattern, the pattern is shifted by its length plus one.
Shift‑And and Shift‑Or Algorithms
These algorithms are beyond the author’s current expertise; only reference links are provided.
http://www.iteye.com/topic/1130001
Bitmap Algorithm Idea
This technique appeared in a Tencent interview; its advantages and disadvantages are discussed in the linked article.
http://www.tuicool.com/articles/aYfEvy
Additional resources with implementations and complexity analysis are listed below.
http://blog.csdn.net/airfer/article/details/8951802/
For ordinary developers, these algorithms are rarely used on the front end, but they can significantly boost performance in rich‑text editors, log processing, and similar scenarios.
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.