An Overview of Kaldi Chain Model Speech Recognition and Its Relationship with HMM‑DNN and Discriminative Training
This article explains the Kaldi chain model speech‑recognition system, covering HMM‑DNN fundamentals, discriminative (MMI) training, the special single‑state HMM topology, TDNN architecture, training pipelines, and experimental results that demonstrate its performance advantages over traditional GMM‑based approaches.
Introduction: The article introduces the chain model speech‑recognition system proposed in Kaldi, first reviewing HMM‑DNN speech‑recognition and discriminative training as prerequisite knowledge.
HMM‑DNN system: Speech recognition is formulated as finding the word sequence W that maximizes P(O|W)·P(W), where P(O|W) comes from the acoustic model and P(W) from the language model. The acoustic model uses an HMM topology for state transitions and a DNN to provide emission probabilities, replacing the traditional GMM.
State tying and triphone modeling: Triphone (three‑phone) modeling dramatically increases parameters, so state tying (pdf‑id sharing) is introduced to reduce data requirements. DNN outputs P(S_t|O_t); using Bayes’ rule it is converted to P(O_t|S_t) for acoustic scoring.
Discriminative training: Sequence‑discriminative (MMI) training maximizes the ratio between the probability of the correct transcription and the sum of probabilities of all competing word sequences, unlike maximum‑likelihood training which only maximizes the correct path. Lattice approximation is used to compute the denominator efficiently.
Chain model: The chain model keeps the HMM‑DNN acoustic structure (typically a TDNN) and adopts MMI discriminative training without lattice approximation. It uses a special single‑state HMM topology where repeated phones are represented by a “blank” state, allowing frame skipping (e.g., one out of three frames) and an acoustic‑model‑level bi‑phone language model, which reduces computational cost.
Kaldi implementation: A step‑by‑step description of the Kaldi egs/aishell2 chain‑model recipe is provided, covering data preparation (dictionary conversion, speaker files, wav.scp, etc.), GMM‑HMM bootstrap, feature extraction (MFCC, i‑vector), lattice generation, TDNN definition, chunking of training data, training stages, and decoding graph construction (HCLG.fst).
Experimental results: Performance comparisons among HMM‑GMM, HMM‑TDNN (DNN replacing GMM), and chain‑model systems are presented on a mixed training set using two 2080 Ti GPUs. Results show that DNN replacement yields a large gain over GMM, and the chain‑model tricks provide further improvements.
Conclusion: The chain model is an important advancement within the HMM‑based speech‑recognition stream, combining classic HMM topology with modern neural‑network and discriminative‑training techniques, and understanding it is key to mastering HMM‑based ASR.
58 Tech
Official tech channel of 58, a platform for tech innovation, sharing, and communication.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.