Artificial Intelligence 17 min read

Deep Learning Approaches for Chinese Word Segmentation: BiLSTM‑CRF and BERT

This article reviews modern deep‑learning methods for Chinese word segmentation, comparing traditional CRF‑based approaches with BiLSTM‑CRF and BERT models, describing their architectures, training procedures, experimental results, and practical considerations for deployment.

58 Tech
58 Tech
58 Tech
Deep Learning Approaches for Chinese Word Segmentation: BiLSTM‑CRF and BERT

Building on a previous overview of Chinese word segmentation techniques, this article introduces current deep‑learning methods that treat segmentation as a sequence‑labeling task, contrasting traditional feature‑engineering approaches (e.g., CRF with manually designed templates) with representation‑learning models that automatically learn features.

The neural‑network background is explained: neurons, layers, and how deeper networks can capture hierarchical features, while noting the trade‑offs of model complexity, training efficiency, and over‑fitting, especially in the era of abundant cloud computing resources.

The article then focuses on the BiLSTM‑CRF segmentation method. It first clarifies the relationships among RNN, LSTM, and BiLSTM, illustrating how BiLSTM captures both past and future context. The model architecture consists of word embeddings (e.g., Word2Vec or GloVe), a stacked BiLSTM layer with configurable parameters (num_steps, num_layers, hidden_size), a softmax output, and a CRF layer that resolves label‑sequence inconsistencies by modeling transition probabilities.

Mathematical formulas for the CRF scoring, normalization, loss function, and Viterbi decoding are presented, followed by implementation details using TensorFlow. Training data comprise 300 k manually annotated sentences from the 58.com scenario and a small portion of aligned People's Daily corpus. Key preprocessing steps include padding/truncating sequences, batch formation, and handling of external versus learned word embeddings.

Experimental results show that BiLSTM‑CRF yields only marginal improvements over pure CRF (accuracy and recall around 95 %) while incurring 3–20× higher inference latency; increasing model depth does not consistently boost performance when training data are limited.

The article then describes a BERT‑based segmentation approach. After a brief recap of BERT’s pre‑training‑then‑fine‑tune paradigm, it outlines BERT’s multi‑head self‑attention encoder, the addition of CLS and SEP tokens, and the two pre‑training objectives (masked language modeling and next‑sentence prediction). For segmentation, the fine‑tuned model replaces the classification head with a token‑level softmax and optionally a CRF layer.

Training details include downloading the Chinese BERT‑base checkpoint, adapting the open‑source TensorFlow code (run_classifier.py) for sequence labeling, and handling input processing (single‑sentence tokenization, label mapping). Experiments on 260 k training sentences and 40 k test sentences demonstrate that BERT achieves 96‑97 % accuracy/recall, outperforming CRF by 1‑2 % and by more than 4 % when labeled data are scarce. Inference latency is about 250 ms on CPU and 10 ms on a P40 GPU, supporting up to ~400 QPS.

In conclusion, while deep‑learning models are broadly applicable across NLP tasks, their benefits for word segmentation are modest; BERT’s strong pre‑training gives it a clear edge, especially in low‑resource settings, and the same architecture can be extended to text classification and similarity tasks.

Deep LearningChinese Word SegmentationNLPBERTCRFBiLSTM
58 Tech
Written by

58 Tech

Official tech channel of 58, a platform for tech innovation, sharing, and communication.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.