Enhancing Speech Keyword Detection Using Prefix Automaton Beam Search
This article presents a method to improve keyword detection in large‑scale speech recognition by integrating a prefix automaton into the beam‑search decoding of seq2seq models, enabling real‑time addition of new terms while reducing computational overhead compared to traditional approaches.
Background
With the rise of deep learning, large models trained on massive data are increasingly used in online tasks. However, these models often struggle with rare keywords in specific domains because they require extensive training data and long iteration cycles.
Seq2Seq Model Decoding
Seq2seq models, introduced in 2014, consist of an encoder and a decoder and are widely used for machine translation and speech recognition. This article focuses on the decoding phase.
During decoding, each time step depends on both the input sequence and the previously generated output. A naive approach that keeps only the highest‑confidence hypothesis at each step can lead to sub‑optimal results, while keeping all hypotheses leads to exponential growth. Beam search mitigates this by retaining the top‑N candidates at each step.
Example: given a pinyin sequence "y i d a l i zh i", beam search keeps the top candidates at each step, reducing the search space from exponential to quadratic.
Problems with Traditional Beam Search and Improvements
Traditional beam search sacrifices accuracy for speed and may miss rare keywords or newly added terms, which is problematic for keyword detection and risk‑control scenarios.
Retraining the model with more data is costly. Google’s 2018 paper proposed adjusting beam‑search results by adding an auxiliary language model during decoding, improving fit for niche corpora without retraining the acoustic model.
This shallow‑fusion approach adds a contextual module that re‑scores hypotheses, updating the top‑N candidates. While it improves recall, it increases computational cost because each candidate is evaluated by the language model multiple times.
Prefix Automaton Weighted Decoding
A prefix automaton (or trie) efficiently handles multi‑pattern matching. By converting the dictionary into a prefix tree and constructing transition states, the decoder can quickly determine valid continuations without recomputing probabilities from the root at each step.
Integrating the prefix automaton with beam search allows the decoder to maintain both the beam candidates and the current automaton state, adding only constant overhead in time and memory.
Real‑Time Addition of New Words
To support real‑time hot‑word insertion, the system maintains two structures: a prefix automaton with transition states and a plain trie for newly added words. When the plain trie exceeds a threshold, it is merged into the automaton and transition states are rebuilt.
This design limits the need to rebuild the entire transition table for each new word, trading a modest increase in query cost for significantly reduced update latency.
Evaluation
In a speech‑recognition system, the prefix‑automaton weighted decoding was compared against standard beam search. Keyword recall improved by 4.6 %, while the character error rate (CER) increased slightly, an acceptable trade‑off.
Conclusion
The proposed method augments seq2seq decoding with a state‑transition automaton, enabling efficient handling of domain‑specific keywords and real‑time updates, thereby enhancing keyword detection in speech transcription without retraining the acoustic model.
References
Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. NIPS.
Zhao, D., et al. (2019). Shallow‑Fusion End‑to‑End Contextual Biasing. Interspeech.
Hori, T., et al. (2007). Efficient WFST‑based one‑pass decoding with on‑the‑fly hypothesis rescoring. IEEE Transactions on Audio, Speech, and Language Processing.
Williams, I., et al. (2018). Contextual Speech Recognition in End‑to‑End Neural Network Systems Using Beam Search. Interspeech.
Hori, T., & Nakamura, A. (2013). Speech recognition algorithms using weighted finite‑state transducers. Synthesis Lectures on Speech and Audio Processing.
Zuoyebang Tech Team
Sharing technical practices from Zuoyebang
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.