FASPell: A Fast, Adaptable, Simple, Powerful Chinese Spell Checker Based on DAE‑Decoder Paradigm
FASPell, a fast and adaptable Chinese spell checker, combines a denoising auto‑encoder with a confidence‑character‑similarity decoder to overcome data scarcity and rigid confusion sets, leveraging unsupervised pre‑training and glyph‑phonetic similarity, delivering simpler architecture, faster inference, and state‑of‑the‑art accuracy for both simplified and traditional Chinese.
This article introduces a paper accepted at EMNLP 2019 titled FASPell: A Fast, Adaptable, Simple, Powerful Chinese Spell Checker Based On DAE‑Decoder Paradigm . The paper’s source code is available at https://github.com/iqiyi/FASPell .
Background : Since the early 1990s, research on detecting and correcting Chinese spelling errors has progressed, but Chinese spell checking remains challenging. Unlike alphabetic languages, Chinese lacks explicit word delimiters and morphological cues, making direct transfer of English‑style methods ineffective.
Two major bottlenecks of previous models :
Over‑fitting on scarce Chinese spelling‑checking data. Existing datasets are small and expensive to create; performance plateaus after generating ~40k synthetic sentences.
Inflexibility and insufficient utilization of character similarity in fixed confusion sets. Fixed confusion sets cannot cover all contextual variations (e.g., simplified vs. traditional characters) and treat all similar characters equally, ignoring nuanced similarity scores.
Paper Overview : The authors propose a new paradigm for Chinese spell checking, named FASPell, which integrates a Denoising Auto‑Encoder (DAE) and a Decoder.
Key advantages over prior state‑of‑the‑art (SOTA) models:
Faster computation and simpler architecture.
Applicable to both simplified and traditional Chinese, and to texts generated by humans or machines.
Improved detection and correction performance.
The improvements stem from addressing the two bottlenecks:
DAE component : Leveraging unsupervised pre‑training (e.g., BERT, XLNet, MASS) reduces the amount of labeled Chinese spelling data needed to fewer than 10,000 sentences.
Decoder component : Replaces the static confusion set with a confidence‑character‑similarity decoder (CSD) that flexibly exploits fine‑grained character similarity.
Methodology :
The DAE is implemented using the masked language model (MLM) from BERT, generating a matrix of possible corrected candidates for each erroneous token. The CSD quantifies character similarity using Unicode IDS representations for glyph structure and incorporates phonetic information from all CJK languages (Mandarin, Cantonese, Japanese, Korean, Vietnamese). During training, the MLM‑generated matrix is plotted in a confidence‑vs‑similarity space to find a decision boundary that separates false positives (FP) from true positives (TP). At inference, the boundary filters out FPs, yielding the final corrected characters.
Experiments and Results : The paper conducts ablation studies on four datasets, demonstrating the individual contributions of the fine‑tuned MLM and the CSD to overall performance. FASPell achieves SOTA accuracy across all benchmarks.
For the full paper, click the original link provided in the source.
iQIYI Technical Product Team
The technical product team of iQIYI
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.