Information Security 15 min read

Sensitive Word Matching Algorithms in Vivo's Content Review System

Vivo’s Diting content‑review platform employs an Aho‑Corasick automaton to perform fast multi‑pattern matching, extending it with combination‑word and pinyin‑based modes that detect slang, homophones, and multi‑keyword violations, thereby boosting precision and lowering miss rates across its four‑stage moderation pipeline.

vivo Internet Technology

Dec 1, 2021

Sensitive Word Matching Algorithms in Vivo's Content Review System

The Diting system is Vivo's content‑review platform that safeguards the continuous and healthy development of its internet products. It primarily reviews text and follows a four‑stage workflow: whitelist matching, sensitive‑word matching, AI machine review, and manual review. Text must pass the first three stages sequentially; if any stage flags the content as suspicious, it proceeds to manual review, otherwise a definitive result is returned.

Sensitive‑word matching is a fast (≈50 ms) operation that quickly filters spam text. However, users constantly invent new slang, homophones, and obfuscations (e.g., “啋票”, “采漂” to evade the keyword “彩票”), making it difficult for simple pattern‑matching algorithms to achieve high precision and low miss rates.

This article examines algorithm selection for the Diting system and presents two practical scenarios that improve precision and reduce miss rates.

Algorithm Selection

The task relies on multi‑pattern matching. Mature algorithms include the Aho‑Corasick (AC) automaton and the Wu‑Manber (WM) algorithm. Because the word library contains millions of entries and loading is infrequent, the system chooses the AC automaton despite its higher memory consumption.

AC Automaton Overview

The AC algorithm builds a Trie of all pattern strings and augments each node with a fail pointer. When a character mismatch occurs, the fail pointer redirects the search to another node that shares a common prefix, allowing linear‑time matching across all patterns. An example Trie for the patterns {"she", "he", "shers", "his", "era"} is shown below.

During matching, the algorithm traverses the target string character by character, following Trie edges; on failure it follows the fail pointer. When the traversal reaches a node marked as the end of a pattern, a match is reported.

System Practice

The Diting system implements three matching modes on top of the AC automaton: ordinary sensitive words, combination sensitive words, and pinyin‑based sensitive words.

Combination Sensitive Words

Some violations only occur when multiple keywords appear together (e.g., “澳门+博彩+网站”). The system splits each combination into individual words, inserts them into the AC Trie, runs the AC match, and then maps the individual hits back to their combinations. The workflow is:

Split each combination into single words and record the mapping.

Add the split words to the AC Trie.

Run the AC automaton on the text.

Map the match results to the corresponding combinations.

Mark a combination as hit only if all its constituent words are hit.

Pinyin Sensitive Words

To counter homophone evasion (e.g., “啋票” for “彩票”), the system converts Chinese text to pinyin and matches the pinyin strings. Because a character may have multiple pronunciations, the conversion yields a “pinyin graph” (a two‑dimensional array) representing all possible phonetic paths. Matching then proceeds by traversing this graph with depth‑first search (DFS) while simultaneously running the AC automaton on each path.

The key steps are:

Replace Trie node characters with string‑type pinyin syllables.

Generate all possible pinyin sequences for the input text, forming a graph.

Perform DFS on the graph, feeding each path to the AC automaton.

Apply pruning: skip a branch if the next node has already been visited and its corresponding AC state is unrelated (i.e., branch‑path length B > Trie depth D).

Summary and Outlook

The Diting system, built on the AC automaton, now supports ordinary, combination, and pinyin‑based sensitive‑word matching, covering most text‑moderation scenarios and reducing the load on both machine and human reviewers. Over one million sensitive words are currently deployed online, greatly enhancing content security. Future work includes adding logical “NOT” operators for exclusion rules and implementing fuzzy matching to further improve precision and recall.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Content Security AC automaton Pinyin Matching sensitive-word detection text moderation

Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.