Artificial Intelligence 15 min read

Spelling Correction System for 58.com Search Engine: Rule‑Based and Statistical Methods

This article describes the design and implementation of a spelling‑correction module for 58.com’s search engine, covering common query errors, rule‑based and statistical language‑model approaches, offline dictionary generation, n‑gram and Viterbi decoding, online workflow, and practical examples.

58 Tech
58 Tech
58 Tech
Spelling Correction System for 58.com Search Engine: Rule‑Based and Statistical Methods

In search engines, users expect high‑quality results related to their query terms, but low‑quality or erroneous queries often lead to poor recall or no results. The spelling‑correction module in 58.com’s main site search addresses this by correcting queries for categories such as real estate, recruitment, classifieds, and used cars, thereby recovering lost traffic and improving user experience.

Common Query Errors

Typical errors include pure pinyin input, misspelled or missing letters in English, and mixed pinyin‑Chinese strings. Errors arise from regional pronunciation differences (e.g., zh→z, ang→an, fei→hui), small mobile screens causing mis‑selection, and input methods like handwriting or voice.

Correction Schemes

Rule‑Based Method

The offline process builds two indexes using the full vocabulary of 58.com’s services and search‑log hot terms: a pinyin index and an edit‑distance index. The pinyin index handles full pinyin, abbreviated pinyin, incomplete pinyin, homophonic and fuzzy‑sound errors; the edit‑distance index handles character‑level misspellings and deletions. During online correction, the query (or its processed form) is looked up in these indexes; if a match is found, the query is corrected accordingly.

Statistical Language‑Model Method

For long queries, an n‑gram language model (3‑gram) is trained with SRILM on the same corpus. The model evaluates the probability of candidate word sequences, and the Viterbi algorithm selects the highest‑probability path. The workflow includes tokenizing the query, generating candidate corrections via pinyin or edit‑distance methods, segmenting candidates, and scoring them with the n‑gram model.

Offline Dictionary Generation

Two dictionaries are created:

Pinyin Dictionary : Generates four types of pinyin‑based indexes (full pinyin, initials, fuzzy‑sound variations, and last‑character initials) for each term.

Edit‑Distance Dictionary : Stores substrings obtained by deleting each character (or each pinyin letter) from a term, together with the deletion position, to support character‑level corrections.

Example entries illustrate how variations such as "ershoudiannao" (二手电脑) map to multiple misspelled forms.

n‑gram Model

The 3‑gram model assumes each word depends only on the two preceding words, drastically reducing computation. Probabilities are calculated as P(word_i | word_{i‑2}, word_{i‑1}) and summed for a candidate sequence. The model is used to rank candidate corrections generated from segmented queries.

Viterbi Decoding

In the 2‑gram case, each candidate’s best path depends on the previous position’s best candidates. In the 3‑gram case, each candidate must consider the two previous positions, storing multiple paths and their probabilities. The final correction is the path with the maximum total probability.

Online Correction Flow

The online process proceeds as follows:

Parse the query to determine whether it is pure pinyin, mixed, or Chinese.

If pure pinyin, apply pinyin correction; if corrected, return the result.

If not corrected, apply edit‑distance correction; if corrected, return the result.

If still uncorrected, perform long‑pinyin correction and return any result.

For queries containing Chinese characters, first apply pinyin correction on the whole string; if unsuccessful, segment the query and apply the language‑model method.

Examples of Correction Scenarios

Erroneous Query

Corrected Query

Method

Level

Shuianhuating

水岸华庭

Pinyin correction (no Chinese characters)

Forced correction without prompt

ATLS

奥特莱斯

Pinyin correction (no Chinese characters)

Forced correction with prompt

linshiG

时工

Pinyin correction (no Chinese characters)

Forced correction with prompt

iphoni4

iphone4

Edit‑distance correction after failed pinyin correction

Forced correction with prompt

shijingshanxiaochaoshi

石景山小超市

Long‑pinyin correction after failed pinyin and edit‑distance

Forced correction with prompt

途an

途安

Pinyin correction (contains Chinese)

Forced correction with prompt

天津医科大学总医院zufang

天津医科大学总医院租房

Mixed pinyin + n‑gram correction

Forced correction without prompt

复试办公公寓

复式办公公寓

n‑gram correction after failed pinyin

Suggested correction with prompt

Conclusion

The module demonstrates how rule‑based and statistical methods can correct most user input errors in the 58.com search scenario, though there remains room for improvement in dictionary construction, candidate generation, and evaluation. Emerging techniques such as CRF‑based error detection, deep‑wide hybrid models, and semantic‑based correction (e.g., Tencent’s approach) represent future directions.

Search Enginequery processinglanguage modelpinyinspelling correctionViterbi algorithm
58 Tech
Written by

58 Tech

Official tech channel of 58, a platform for tech innovation, sharing, and communication.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.