Artificial Intelligence 8 min read

How DeepXi and MHANet Revolutionize Speech Enhancement with Multi‑Head Attention

DeepXi introduces a two‑stage deep learning framework for speech enhancement, using prior SNR estimation and MMSE gain, while the MHANet extension leverages multi‑head attention to model long‑range dependencies, with detailed training strategies, model compression to GRU, deployment via TFLite, and impressive low‑latency results.

Douyu Streaming

Oct 20, 2021

How DeepXi and MHANet Revolutionize Speech Enhancement with Multi‑Head Attention

Background

Speech enhancement algorithms aim to improve perceived quality and intelligibility of noisy speech by suppressing background noise without distorting the speech.

Currently, deep learning methods are at the forefront, with deep neural networks (DNNs) used to map noisy speech magnitude spectra to clean spectra or noisy time‑domain frames to clean frames.

DeepXi Framework

DeepXi is a deep‑learning method for prior SNR estimation.

DeepXi consists of two stages:

Stage 1: Input noisy speech magnitude spectrum; a DNN estimates a mapped prior SNR, scaled to the [0,1] interval to accelerate SGD convergence.

Stage 2: The mapped prior SNR is used to compute an MMSE‑approximate gain function, which multiplies the noisy magnitude spectrum to obtain an estimate of the clean magnitude spectrum.

MHANet

Within the DeepXi framework, the DNN can be any architecture such as RNN or TCN.

Multi‑head attention (MHA) outperforms RNN and TCN in tasks like machine translation by modeling long‑range dependencies via sequence similarity. The DeepXi‑MHANet (DeepXi‑MHANet) incorporates MHA to effectively model long‑term dependencies of noisy speech.

MHANet details:

Input noisy speech magnitude |X|, add positional encoding, first layer projects to d_model, then B blocks output mapped prior SNR. Each block contains an MHA module, a two‑layer feed‑forward network, residual connections, and frame‑wise normalization.

MHA module: queries Q, keys K, values V; output is weighted sum of V with attention weights computed from Q‑K similarity. Each head uses masked scaled dot‑product attention; dimensions satisfy d_k = d_v = d_model / H.

Masked scaled dot‑product attention computes similarity via scaled dot product, optionally applies a mask, then softmax, and multiplies by V_h to produce attention‑enhanced values.

Training Strategy

Cross‑entropy loss.

Mini‑batch size = 10, 200 training iterations.

Each mini‑batch mixes clean speech with randomly selected noise at random start points and SNRs ranging from –10 to 20 dB in 1 dB steps.

Clean speech selection is random each iteration.

Optimizer: Adam with β₁=0.9, β₂=0.98, ε=10⁻⁹.

Learning rate α follows a warm‑up schedule: linearly increases until warm‑up steps ψ, then decays proportionally to the inverse square root of training steps.

Improvement Strategies

Because the Transformer model is large and unsuitable for edge deployment, a GRU‑based architecture was adopted.

Model Optimization

Replace Transformer with GRU, reducing parameters from 4.8 M to 0.3 M.

Limit model size to under 1 MB.

Apply data augmentation such as reverberation.

Deployment Solutions

Address data continuity issues caused by segment‑wise processing.

Adopt TFLite as the deployment framework.

Efficiently implement algorithmic operators like inverse error function and integration.

Results

The full deployment pipeline produces an algorithm library meeting client requirements.

Parameters: 300 k

Memory: ≤10 MB

Latency: 16 ms

Library size: ≤2 MB

Effect Demonstration

The algorithm combines traditional methods with deep learning to denoise while preserving speech quality. Sample denoising results include keyboard noise and wind/road noise.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

deep learning GRU multi-head attention speech enhancement audio denoising TFLite

Written by

Douyu Streaming

Official account of Douyu Streaming Development Department, sharing audio and video technology best practices.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.