Weighted Finite State Transducers (WFST) in Traditional Speech Recognition: Principles and Optimization
This article explains the role of Weighted Finite State Transducers in conventional HMM‑based speech recognition, covering language models, pronunciation dictionaries, WFST definitions, semiring theory, composition and determinization operations, decoding graph construction (HCLG), lattice rescoring, and practical optimization techniques for real‑world scenarios.
In traditional HMM‑based Automatic Speech Recognition (ASR) systems, Weighted Finite State Transducers (WFST) are crucial for integrating acoustic models, pronunciation lexicons, and language models into a unified decoding graph.
The article first reviews the background of ASR, describing the pipeline of acoustic modeling, lexicon, language model, and decoder, and introduces the challenges of mapping HMM states to word sequences.
It then details language model fundamentals, including n‑gram formulation, smoothing techniques (interpolation, back‑off, Kneser‑Ney), perplexity evaluation, and tools such as SRILM, IRSTLM, and rnnlm, emphasizing their impact on decoding performance.
Next, the definition of WFST is presented, distinguishing it from related finite‑state machines (FSA, FST, WFSA) and illustrating weighted transitions (e.g., a:z/1.2). The role of semirings—Probability and Tropical—is explained, showing how they affect path scoring and shortest‑path search in decoding.
The core WFST operations are described: composition (combining graphs at matching input/output symbols) and optimization steps such as determinization, weight pushing, epsilon removal, and minimization, with examples of iterative state‑pair processing.
Using these operations, the construction of the HCLG decoding graph is explained (H: HMM topology, C: context‑dependent phone mapping, L: lexicon, G: language model), and the alternating sequence of composition and optimization that yields an efficient decoder.
The concept of lattices is introduced as pruned decoding graphs that enable rescoring, lattice‑based discriminative training, and fast adaptation to new language models without full recomposition. Techniques such as "subtract‑add" rescoring and lattice‑based beam pruning are discussed.
Finally, practical scenarios are covered, including language‑model adaptation to specific domains, lattice rescoring, hot‑word integration, and model fusion via interpolation, highlighting how these methods improve recognition accuracy while keeping acoustic models unchanged.
The article concludes that optimizing WFST‑based decoding graphs remains a key advantage of traditional ASR pipelines over end‑to‑end approaches, especially when adapting to new business cases or fixing bad cases.
58 Tech
Official tech channel of 58, a platform for tech innovation, sharing, and communication.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.