Sensitive Word Matching in Vivo's Content Review System: Algorithm Selection and Practical Implementations
The article describes how Vivo's content moderation platform, DiTing, uses algorithm selection—including Aho‑Corasick automaton, combination word matching, and pinyin‑based matching—to efficiently detect sensitive terms in large‑scale text streams, while addressing challenges such as homophones, multi‑character patterns, and performance constraints.
Vivo's DiTing system is a content‑review platform that safeguards the company's internet products by processing text through a four‑stage pipeline: whitelist matching, sensitive‑word matching, AI machine review, and manual review. The primary focus is on fast, accurate detection of sensitive words.
The sensitive‑word matching component relies on multi‑pattern matching algorithms. After evaluating AC automaton and WM, the team chose the AC automaton because, despite higher memory usage and longer loading time, it handles million‑scale dictionaries well and fits the server resources.
The AC automaton is explained with its core concepts—Trie dictionary tree and fail pointers—illustrated by a sample pattern set {"she", "he", "shers", "his", "era"}. The algorithm builds a Trie, adds fail pointers, and traverses the target string, backtracking via fail links when mismatches occur.
Combination Sensitive Words : Some illegal content is only detectable when several keywords appear together (e.g., "澳门", "博彩", "网站"). The solution splits each combination into individual words, inserts them into the AC Trie, runs the automaton, and then maps matched words back to their combinations. The workflow includes:
Split combinations into single words and record the mapping.
Add the split words to the AC Trie.
Run the AC automaton on the text.
Map matched words to their original combinations.
Determine which combinations are fully hit.
Pinyin Sensitive Words : To catch obfuscations using homophones or similar‑sounding characters (e.g., "啋票" for "彩票"), the text is converted to pinyin and matched against pinyin‑based patterns. Because multi‑tone characters can have multiple pronunciations, the system enumerates all possible pinyin sequences, forming a two‑dimensional "pinyin graph". Matching then combines depth‑first search (DFS) on the graph with the AC automaton, requiring special termination and pruning strategies.
Termination occurs when every node in the pinyin graph has been visited and the AC automaton reports a failure. Pruning eliminates redundant paths by checking whether a node has already been traversed and whether the current Trie depth D is less than the branch length B of the pinyin graph.
In summary, DiTing integrates ordinary, combination, and pinyin‑based sensitive‑word matching using the AC automaton, achieving high coverage for text moderation, reducing manual workload, and allowing rapid policy updates. The system currently manages over one million sensitive terms and plans future enhancements such as logical negation and fuzzy matching to further improve precision and recall.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.