ASR Error Correction with BERT, ELECTRA and a Fuzzy‑Phoneme Generator: Techniques from Xiaomi AI
This article describes how Xiaomi's AI team tackles Automatic Speech Recognition (ASR) query errors by analyzing error patterns, employing BERT, ELECTRA and a soft‑masked BERT model, generating synthetic noisy data with a fuzzy‑phoneme generator, and presenting experimental results and future research directions.
The talk introduces the problem of ASR query errors in Xiaomi's voice assistant, where inaccurate transcription of user speech leads to downstream NLU failures; correcting these errors is essential for reliable voice interaction.
It reviews typical ASR mistakes, such as homophonic substitutions and mixed‑language queries, and discusses why standard spelling correction methods are insufficient for phonetic errors.
Related work includes the BERT model (masked language modeling) and ELECTRA (generator‑discriminator pre‑training), highlighting their strengths and limitations for error detection and correction.
The authors propose a soft‑masked BERT correction model that combines a BiGRU error detector with BERT, using a weighted average of token and [MASK] embeddings based on error probabilities, achieving higher correction accuracy than vanilla BERT.
To train the system, they build a large corpus (≈100 M sentences) from Wikipedia, Zhihu, news, and user logs, and design a fuzzy‑phoneme generator that creates synthetic ASR errors by replacing characters according to phonetic similarity levels (five grades of fuzziness) using a non‑standard pinyin scheme.
The correction pipeline consists of a generator that produces erroneous sentences and a discriminator that predicts the correct tokens; the discriminator leverages both character embeddings from BERT and phoneme embeddings, concatenated before a softmax layer.
Experimental results show that adding vocabulary filtering and recursive prediction raises F1 from 9.3 % to 21.6 %; further training with the correction data brings it to 65 %; additional tricks and phoneme features push the final F1 to 77.6 %.
Future work includes exploring seq2seq BERT‑decoder models, incorporating contextual attention for long‑tail domain knowledge, and developing mechanisms for rapid model updates to handle emerging hot‑topic queries.
Sohu Tech Products
A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.