Artificial Intelligence 18 min read

ASR Error Correction with BERT, ELECTRA and a Fuzzy‑Phoneme Generator: Techniques from Xiaomi AI

This article describes how Xiaomi's AI team tackles Automatic Speech Recognition (ASR) query errors by analyzing error patterns, employing BERT, ELECTRA and a soft‑masked BERT model, generating synthetic noisy data with a fuzzy‑phoneme generator, and presenting experimental results and future research directions.

Sohu Tech Products

Aug 19, 2020

ASR Error Correction with BERT, ELECTRA and a Fuzzy‑Phoneme Generator: Techniques from Xiaomi AI

The talk introduces the problem of ASR query errors in Xiaomi's voice assistant, where inaccurate transcription of user speech leads to downstream NLU failures; correcting these errors is essential for reliable voice interaction.

It reviews typical ASR mistakes, such as homophonic substitutions and mixed‑language queries, and discusses why standard spelling correction methods are insufficient for phonetic errors.

Related work includes the BERT model (masked language modeling) and ELECTRA (generator‑discriminator pre‑training), highlighting their strengths and limitations for error detection and correction.

The authors propose a soft‑masked BERT correction model that combines a BiGRU error detector with BERT, using a weighted average of token and [MASK] embeddings based on error probabilities, achieving higher correction accuracy than vanilla BERT.

To train the system, they build a large corpus (≈100 M sentences) from Wikipedia, Zhihu, news, and user logs, and design a fuzzy‑phoneme generator that creates synthetic ASR errors by replacing characters according to phonetic similarity levels (five grades of fuzziness) using a non‑standard pinyin scheme.

The correction pipeline consists of a generator that produces erroneous sentences and a discriminator that predicts the correct tokens; the discriminator leverages both character embeddings from BERT and phoneme embeddings, concatenated before a softmax layer.

Experimental results show that adding vocabulary filtering and recursive prediction raises F1 from 9.3 % to 21.6 %; further training with the correction data brings it to 65 %; additional tricks and phoneme features push the final F1 to 77.6 %.

Future work includes exploring seq2seq BERT‑decoder models, incorporating contextual attention for long‑tail domain knowledge, and developing mechanisms for rapid model updates to handle emerging hot‑topic queries.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

deep learning NLP BERT Speech Recognition Error Correction ASR ELECTRA

Written by

Sohu Tech Products

A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.