Chinese Word Segmentation: Challenges, Methods, and Practical Practices
The article explains why Chinese word segmentation is essential for NLP tasks, outlines its fundamental difficulties such as ambiguity and out‑of‑vocabulary words, reviews dictionary‑based, statistical, and CRF approaches, and shares practical experiences from 58 Search’s production system.
Text and language are crucial carriers of information, and efficient understanding of them has long been a focus of research. Since the advent of modern computers, natural language processing (NLP) emerged as the technology that uses computers to analyze and process text.
Chinese language processing targets Chinese specifically and typically includes lexical analysis (segmentation, part‑of‑speech tagging, entity recognition), syntactic analysis, semantic analysis, and pragmatic analysis. Among these, Chinese word segmentation is the most fundamental technique, serving downstream tasks such as information retrieval, text classification, machine translation, question answering, and summarization.
Why Chinese needs segmentation
Unlike English, where spaces separate words, Chinese characters form a continuous string without explicit delimiters. Segmentation splits this string into meaningful words, which is vital for building inverted indexes in search engines. Without proper segmentation, indexes would be built on single characters, leading to poor precision and massive computational overhead.
Basic problems in Chinese segmentation
Segmentation faces three main challenges: segmentation standards, ambiguous splitting, and out‑of‑vocabulary (OOV) words. Ambiguity arises when a character sequence can be split in multiple ways (e.g., "交集型" vs. "组合型" ambiguities). OOV words include newly coined terms, proper nouns, and domain‑specific terminology, which dramatically affect segmentation accuracy.
Segmentation methods
Methods can be divided into two broad categories. Dictionary‑based methods (e.g., forward maximum match, reverse maximum match) rely on a word list and are simple but struggle with ambiguity and OOV words. Statistical methods combine a dictionary with language models such as n‑gram, HMM, CRF, or RNN, using probabilities to choose the most likely split.
Dictionary‑based segmentation
This approach scans the text left‑to‑right, matches the longest words in the dictionary, and treats unknown character strings as single‑character words. It is easy to implement and solves most cases, but cannot handle ambiguous splits or OOV words effectively.
Dictionary + n‑gram language model
By calculating word frequencies and conditional probabilities (e.g., P(的|这 话 说)), the method evaluates the likelihood of each possible segmentation and selects the one with the highest sentence probability using a bigram model.
The process builds a directed acyclic graph of candidate words and uses algorithms like Viterbi to find the highest‑weight path, effectively resolving ambiguities when combined with an OOV detection module.
58 Search’s early segmentation system followed this approach, using reverse maximum match for initial tokenization, then estimating word frequencies and transition probabilities for the language model. OOV words were discovered via large‑scale corpora using frequency, mutual information, and left/right entropy, followed by manual review.
Sequence labeling with CRF
The CRF model treats segmentation as a character‑level labeling problem (B/M/E/S tags). It captures rich contextual features through feature templates and uses Viterbi decoding to find the most probable tag sequence, achieving high OOV recall and overall accuracy above 95%.
Despite its strengths, CRF can produce inconsistent segmentations across different contexts. To mitigate this, 58 Search adds rule‑based modules (e.g., handling punctuation, idioms, URLs, dates) to split text into short fragments before applying the model, balancing consistency and contextual awareness.
Granularity and index expansion
Segmentation granularity (coarse vs. fine) affects recall and precision. 58 Search adopts a finer granularity for indexing and then expands terms via synonym dictionaries, entailment dictionaries, and rule‑based templates (e.g., expanding "第N医院" to "N医院").
Corpus collection and annotation
About 60‑70% of NLP effort goes into data collection and preprocessing. 58 Search combines open‑source corpora (e.g., People's Daily, Sogou) with domain‑specific data, focusing annotation on low‑confidence or poorly segmented samples identified through online confidence thresholds or offline statistical analysis.
Conclusion
The article reviews representative segmentation methods—from dictionary‑based to statistical and CRF approaches—highlighting their evolution and practical challenges. It notes the shift toward neural and pre‑trained language models, which can learn richer semantic features without extensive manual feature engineering, and hints at future work in that direction.
58 Tech
Official tech channel of 58, a platform for tech innovation, sharing, and communication.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.