cw2vec: Learning Chinese Word Embeddings with Stroke n-grams
The cw2vec paper, presented at AAAI 2018, introduces a Chinese word embedding method that leverages stroke n‑grams to capture character semantics, proposes a novel loss function, demonstrates consistent improvements over existing models across similarity, analogy, classification and NER tasks, and discusses real‑world AI applications.
Word‑vector algorithms are fundamental to natural language processing, but most existing methods, such as word2vec, are designed for Latin‑script languages and ignore the rich semantic information inherent in Chinese characters. The cw2vec model, a collaboration between Ant Financial AI Lab and Singapore University of Technology and Design, addresses this gap by representing Chinese words through n‑gram sequences of strokes.
The authors define “stroke n‑grams” as contiguous sequences of n strokes within a character, treating each n‑gram as a semantic unit. A new loss function is introduced that combines a sigmoid‑based similarity term with negative sampling, allowing efficient training without the computational burden of a full softmax.
During preprocessing, each word is decomposed into its constituent characters, each character is split into strokes, strokes are mapped to numeric IDs, and sliding windows generate the stroke n‑grams. Each n‑gram receives its own embedding vector, initialized randomly with the same dimensionality as traditional word vectors.
Experimental evaluation on public datasets compares cw2vec with word2vec (skip‑gram and CBOW), GloVe, CWE, and recent pixel‑/radical‑based Chinese embedding methods. Across word similarity, word analogy, text classification, and named‑entity recognition tasks, cw2vec consistently outperforms the baselines. Additional experiments varying embedding dimensionality and using only 20 % of Chinese Wikipedia as training data further confirm its robustness, especially on small corpora.
Case studies on the terms “water pollution” and “Sun Wukong” illustrate cw2vec’s ability to capture fine‑grained semantic relations that other models miss, thanks to the combined influence of stroke information and contextual word vectors.
Beyond research, the cw2vec technique has been deployed in Ant Group’s intelligent customer service, text risk control, and recommendation systems, and similar approaches have been explored for Japanese and Korean, resulting in nearly twenty related patent applications.
The paper can be accessed at https://github.com/ShelsonCao/cw2vec/blob/master/cw2vec.pdf .
AntTech
Technology is the core driver of Ant's future creation.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.