Artificial Intelligence 12 min read

GPU Hotword Enhancement for WeNet End-to-End Speech Recognition

This article explains the design, implementation, and experimental evaluation of hot‑word augmentation in WeNet's GPU runtime, detailing how character‑ and word‑based language model scoring are extended to boost recognition of rare proper nouns in both streaming and non‑streaming ASR services.

58 Tech

Jun 21, 2023

GPU Hotword Enhancement for WeNet End-to-End Speech Recognition

End‑to‑end speech recognition systems achieve good accuracy with large training data, but they often misrecognize rare proper nouns such as personal names, product names, or place names; a fast remedy is hot‑word enhancement.

WeNet already supports hot‑words on its CPU runtime for both CTC Prefix Beam Search and WFST Beam Search decoders. The GPU runtime now adds hot‑word support to the existing ctc_decoder, enabling both streaming and non‑streaming services, and the implementation has been open‑sourced to the WeNet community.

The GPU hot‑word implementation builds on the language‑model scoring inside ctc_decoder. The decoder can use either a character‑based (sub‑character) or a word‑based language model. For Chinese, spaces are removed and a character‑based N‑gram model (e.g., order‑4) computes probabilities; for English, a word‑based model does the same. The hot‑word scorer reuses the make_ngram function to extract N‑grams from the current prefix, checks them against a user‑provided hot‑word dictionary, and adds a configurable weight to the log‑probability when a match is found.

2.1 Output‑character hot‑word enhancement is illustrated with a four‑frame example (characters "语", "音", "识", "别"). The process consists of three steps: (1) obtain the characters in a fixed‑size window using make_ngram[4], (2) combine adjacent characters into candidate words with std::accumulate, and (3) add hot‑word scores for any candidate present in the dictionary. Only the first matching hot‑word in a window is scored.

2.2 Output‑word hot‑word enhancement applies when the acoustic model outputs spaces. After a space is emitted, the decoder looks back over the previous window (e.g., four tokens) to form words, checks each word against the hot‑word list, and adds the corresponding weight.

3.1 Using hot‑words in WeNet GPU Runtime requires editing scoring/1/model.py and wenet/1/wenet_onnx_model.py. The steps are: (1) create a hotwords.yaml file and reference its path in config.pbtxt, (2) the runtime automatically loads the dictionary if the path exists, (3) initialize HotWordsScorer with the dictionary, vocabulary, window length (default 4), SPACE_ID, and is_character_based, and (4) pass the scorer to ctc_beam_search_decoder_batch. Configuration examples and command‑line parameters (e.g., beam_size=10, space_id=45, window_length=4) are provided.

Experiments on the AISHELL‑1 test set and a dedicated hot‑word sub‑test set show that adding hot‑words improves overall accuracy with negligible impact on decoding speed. Detailed results and the hot‑word files are available in the referenced repositories.

Future work includes adding a trie‑based hot‑word implementation, combining neural language‑model scoring with N‑gram scoring, and supporting timestamp alignment.

The article concludes with author bios (Yang Jiao and Zhou Wei) and a brief introduction to the 58.com AI Lab.

References: [1] https://github.com/wenet-e2e/wenet/pull/1860 [2] https://github.com/Slyne/ctc_decoder [3] ctc_decoder/swig/scorer.cpp#LL176C34-L176C44 [4] ctc_decoder/swig/hotwords.cpp#L43 [5] ctc_decoder/swig/hotwords.cpp#LL91C9-L91C9 [6] ctc_decoder/swig/scorer.cpp#L166 [7] https://www.openslr.org/33/ [8] https://www.modelscope.cn/datasets/speech_asr/speech_asr_aishell1_hotwords_testsets [9] https://huggingface.co/58AILab/wenet_u2pp_aishell1_with_hotwords

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

GPU Speech Recognition language model ASR CTC decoder hotword wenet

Written by

58 Tech

Official tech channel of 58, a platform for tech innovation, sharing, and communication.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.