Optimizing a Real‑Time Keyword Matching Service with Aho‑Corasick and Double‑Array Trie
By replacing the naïve double‑loop matcher with a Double‑Array Trie‑based Aho‑Corasick automaton and refactoring the system into a layered name‑and‑data microservice architecture that shards the keyword dictionary and rebuilds the automaton only on version changes, the real‑time keyword‑matching service reduced latency from seconds to milliseconds even at thousands of QPS.
The discussion begins with a newcomer, Xiao Cai, inheriting a real‑time keyword‑matching service that processes long texts, matches them against a keyword dictionary, and returns matched keywords together with product‑specific strategies. The service is critical for downstream video upload and content‑review pipelines.
When the query per second (QPS) is low, the service meets latency requirements, but at high QPS (hundreds of QPS) the response time degrades to several seconds, causing user dissatisfaction. The current implementation uses a naïve double‑loop algorithm:
for (input : inputs) {
for (keyword : keywords) {
if (input.contains(keyword.name) && keyword.products.contains(input.product)) {
// match found
} else {
// no match
}
}
}The keyword data is stored in a MySQL table:
CREATE TABLE `keyword` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`type` int(5) NOT NULL COMMENT '关键词策略',
`name` varchar(100) NOT NULL COMMENT '关键词内容',
`description` varchar(1000) COMMENT '关键词描述',
`products` varchar(1200) COMMENT '适用产品,多个产品,分隔',
`status` int(11) DEFAULT NULL
);Because the dictionary contains millions of entries, the O(m·n) time complexity (m = text length, n = number of keywords) becomes a bottleneck. The team proposes upgrading the algorithm to the Aho‑Corasick (AC) automaton, which combines a Trie (prefix tree) with failure links derived from the KMP algorithm, achieving linear O(L) matching where L is the total length of the input text, independent of the number of patterns.
The AC construction consists of three tables:
goto : the state transition table, essentially the Trie structure.
failure : links used when a transition fails, pointing to the longest proper suffix that is also a prefix.
output : the list of keywords matched at each state.
Example output table:
i
Output(i)
2
he
5
she,he
7
his
9
hers
Example failure table:
i
Fail(i)
0
0
1
0
2
0
3
0
4
1
5
2
6
0
7
3
8
0
9
3
To further reduce memory consumption, the team explores a Double‑Array Trie (DAT) representation, which stores the Trie in two linear arrays, eliminating pointer overhead while preserving O(L) matching speed.
Beyond algorithmic changes, the service architecture is refactored into two layers:
name service: handles load balancing, service discovery, and routing.
data service: performs the actual keyword matching. Multiple data instances (data1, data2, …) are deployed, each holding a shard of the keyword dictionary (e.g., 500 k keywords per instance) to support scaling to tens of millions of entries.
Service discovery is achieved via a registration center. The name service discovers all data services whose names share the prefix keyword.check.act. , forwards requests to them, aggregates results, and returns the combined matches.
To avoid unnecessary cache refreshes, a version‑controller component monitors the keyword table’s primary‑key list. Only when the list changes does it rebuild the AC automaton; otherwise the existing automaton continues serving requests, eliminating the waste of refreshing a static cache every minute.
Performance measurements after the upgrades show dramatic improvements:
For matched queries, latency dropped from ~4 seconds at 50 QPS to ~2 ms at 1 kQPS.
For unmatched queries, latency improved from ~10 seconds at 20 QPS to ~2 ms at 1 kQPS.
The final summary highlights two key achievements:
Algorithm upgrade: replacing the double‑loop with a Double‑Array Trie‑based AC automaton reduces matching complexity to O(L).
Architecture upgrade: introducing a layered microservice design (name + data) and a version‑controlled AC builder enables horizontal scaling, high availability, and near‑real‑time handling of keyword updates.
Sohu Tech Products
A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.