Artificial Intelligence 12 min read

Enhancing Speech Keyword Detection Using Prefix Automaton Beam Search

This article presents a method to improve keyword detection in large‑scale speech recognition by integrating a prefix automaton into the beam‑search decoding of seq2seq models, enabling real‑time addition of new terms while reducing computational overhead compared to traditional approaches.

Zuoyebang Tech Team
Zuoyebang Tech Team
Zuoyebang Tech Team
Enhancing Speech Keyword Detection Using Prefix Automaton Beam Search

Background

With the rise of deep learning, large models trained on massive data are increasingly used in online tasks. However, these models often struggle with rare keywords in specific domains because they require extensive training data and long iteration cycles.

Seq2Seq Model Decoding

Seq2seq models, introduced in 2014, consist of an encoder and a decoder and are widely used for machine translation and speech recognition. This article focuses on the decoding phase.

During decoding, each time step depends on both the input sequence and the previously generated output. A naive approach that keeps only the highest‑confidence hypothesis at each step can lead to sub‑optimal results, while keeping all hypotheses leads to exponential growth. Beam search mitigates this by retaining the top‑N candidates at each step.

Example: given a pinyin sequence "y i d a l i zh i", beam search keeps the top candidates at each step, reducing the search space from exponential to quadratic.

Problems with Traditional Beam Search and Improvements

Traditional beam search sacrifices accuracy for speed and may miss rare keywords or newly added terms, which is problematic for keyword detection and risk‑control scenarios.

Retraining the model with more data is costly. Google’s 2018 paper proposed adjusting beam‑search results by adding an auxiliary language model during decoding, improving fit for niche corpora without retraining the acoustic model.

This shallow‑fusion approach adds a contextual module that re‑scores hypotheses, updating the top‑N candidates. While it improves recall, it increases computational cost because each candidate is evaluated by the language model multiple times.

Prefix Automaton Weighted Decoding

A prefix automaton (or trie) efficiently handles multi‑pattern matching. By converting the dictionary into a prefix tree and constructing transition states, the decoder can quickly determine valid continuations without recomputing probabilities from the root at each step.

Integrating the prefix automaton with beam search allows the decoder to maintain both the beam candidates and the current automaton state, adding only constant overhead in time and memory.

Real‑Time Addition of New Words

To support real‑time hot‑word insertion, the system maintains two structures: a prefix automaton with transition states and a plain trie for newly added words. When the plain trie exceeds a threshold, it is merged into the automaton and transition states are rebuilt.

This design limits the need to rebuild the entire transition table for each new word, trading a modest increase in query cost for significantly reduced update latency.

Evaluation

In a speech‑recognition system, the prefix‑automaton weighted decoding was compared against standard beam search. Keyword recall improved by 4.6 %, while the character error rate (CER) increased slightly, an acceptable trade‑off.

Conclusion

The proposed method augments seq2seq decoding with a state‑transition automaton, enabling efficient handling of domain‑specific keywords and real‑time updates, thereby enhancing keyword detection in speech transcription without retraining the acoustic model.

References

Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. NIPS.

Zhao, D., et al. (2019). Shallow‑Fusion End‑to‑End Contextual Biasing. Interspeech.

Hori, T., et al. (2007). Efficient WFST‑based one‑pass decoding with on‑the‑fly hypothesis rescoring. IEEE Transactions on Audio, Speech, and Language Processing.

Williams, I., et al. (2018). Contextual Speech Recognition in End‑to‑End Neural Network Systems Using Beam Search. Interspeech.

Hori, T., & Nakamura, A. (2013). Speech recognition algorithms using weighted finite‑state transducers. Synthesis Lectures on Speech and Audio Processing.

Beam SearchSeq2Seqspeech recognitionreal-time decodingkeyword detectionprefix automaton
Zuoyebang Tech Team
Written by

Zuoyebang Tech Team

Sharing technical practices from Zuoyebang

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.