Artificial Intelligence 7 min read

cw2vec: Learning Chinese Word Embeddings with Stroke n-grams

The cw2vec paper, presented at AAAI 2018, introduces a Chinese word embedding method that leverages stroke n‑grams to capture character semantics, proposes a novel loss function, demonstrates consistent improvements over existing models across similarity, analogy, classification and NER tasks, and discusses real‑world AI applications.

AntTech
AntTech
AntTech
cw2vec: Learning Chinese Word Embeddings with Stroke n-grams

Word‑vector algorithms are fundamental to natural language processing, but most existing methods, such as word2vec, are designed for Latin‑script languages and ignore the rich semantic information inherent in Chinese characters. The cw2vec model, a collaboration between Ant Financial AI Lab and Singapore University of Technology and Design, addresses this gap by representing Chinese words through n‑gram sequences of strokes.

The authors define “stroke n‑grams” as contiguous sequences of n strokes within a character, treating each n‑gram as a semantic unit. A new loss function is introduced that combines a sigmoid‑based similarity term with negative sampling, allowing efficient training without the computational burden of a full softmax.

During preprocessing, each word is decomposed into its constituent characters, each character is split into strokes, strokes are mapped to numeric IDs, and sliding windows generate the stroke n‑grams. Each n‑gram receives its own embedding vector, initialized randomly with the same dimensionality as traditional word vectors.

Experimental evaluation on public datasets compares cw2vec with word2vec (skip‑gram and CBOW), GloVe, CWE, and recent pixel‑/radical‑based Chinese embedding methods. Across word similarity, word analogy, text classification, and named‑entity recognition tasks, cw2vec consistently outperforms the baselines. Additional experiments varying embedding dimensionality and using only 20 % of Chinese Wikipedia as training data further confirm its robustness, especially on small corpora.

Case studies on the terms “water pollution” and “Sun Wukong” illustrate cw2vec’s ability to capture fine‑grained semantic relations that other models miss, thanks to the combined influence of stroke information and contextual word vectors.

Beyond research, the cw2vec technique has been deployed in Ant Group’s intelligent customer service, text risk control, and recommendation systems, and similar approaches have been explored for Japanese and Korean, resulting in nearly twenty related patent applications.

The paper can be accessed at https://github.com/ShelsonCao/cw2vec/blob/master/cw2vec.pdf .

AI researchChinese NLPword embeddingsAAAI 2018stroke n-grams
AntTech
Written by

AntTech

Technology is the core driver of Ant's future creation.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.