Artificial Intelligence 8 min read

How DeepXi and MHANet Revolutionize Speech Enhancement with Multi‑Head Attention

DeepXi introduces a two‑stage deep learning framework for speech enhancement, using prior SNR estimation and MMSE gain, while the MHANet extension leverages multi‑head attention to model long‑range dependencies, with detailed training strategies, model compression to GRU, deployment via TFLite, and impressive low‑latency results.

Douyu Streaming
Douyu Streaming
Douyu Streaming
How DeepXi and MHANet Revolutionize Speech Enhancement with Multi‑Head Attention

Background

Speech enhancement algorithms aim to improve perceived quality and intelligibility of noisy speech by suppressing background noise without distorting the speech.

Currently, deep learning methods are at the forefront, with deep neural networks (DNNs) used to map noisy speech magnitude spectra to clean spectra or noisy time‑domain frames to clean frames.

DeepXi Framework

DeepXi is a deep‑learning method for prior SNR estimation.

DeepXi architecture
DeepXi architecture

DeepXi consists of two stages:

Stage 1: Input noisy speech magnitude spectrum; a DNN estimates a mapped prior SNR, scaled to the [0,1] interval to accelerate SGD convergence.

Stage 2: The mapped prior SNR is used to compute an MMSE‑approximate gain function, which multiplies the noisy magnitude spectrum to obtain an estimate of the clean magnitude spectrum.

MHANet

Within the DeepXi framework, the DNN can be any architecture such as RNN or TCN.

Multi‑head attention (MHA) outperforms RNN and TCN in tasks like machine translation by modeling long‑range dependencies via sequence similarity. The DeepXi‑MHANet (DeepXi‑MHANet) incorporates MHA to effectively model long‑term dependencies of noisy speech.

DeepXi-MHANet architecture
DeepXi-MHANet architecture

MHANet details:

Input noisy speech magnitude |X|, add positional encoding, first layer projects to d_model, then B blocks output mapped prior SNR. Each block contains an MHA module, a two‑layer feed‑forward network, residual connections, and frame‑wise normalization.

MHA module: queries Q, keys K, values V; output is weighted sum of V with attention weights computed from Q‑K similarity. Each head uses masked scaled dot‑product attention; dimensions satisfy d_k = d_v = d_model / H.

Masked scaled dot‑product attention computes similarity via scaled dot product, optionally applies a mask, then softmax, and multiplies by V_h to produce attention‑enhanced values.

Training Strategy

Cross‑entropy loss.

Mini‑batch size = 10, 200 training iterations.

Each mini‑batch mixes clean speech with randomly selected noise at random start points and SNRs ranging from –10 to 20 dB in 1 dB steps.

Clean speech selection is random each iteration.

Optimizer: Adam with β₁=0.9, β₂=0.98, ε=10⁻⁹.

Learning rate α follows a warm‑up schedule: linearly increases until warm‑up steps ψ, then decays proportionally to the inverse square root of training steps.

Learning rate schedule
Learning rate schedule

Improvement Strategies

Because the Transformer model is large and unsuitable for edge deployment, a GRU‑based architecture was adopted.

Model Optimization

Replace Transformer with GRU, reducing parameters from 4.8 M to 0.3 M.

Limit model size to under 1 MB.

Apply data augmentation such as reverberation.

Deployment Solutions

Address data continuity issues caused by segment‑wise processing.

Adopt TFLite as the deployment framework.

Efficiently implement algorithmic operators like inverse error function and integration.

Results

The full deployment pipeline produces an algorithm library meeting client requirements.

Parameters: 300 k

Memory: ≤10 MB

Latency: 16 ms

Library size: ≤2 MB

Effect Demonstration

The algorithm combines traditional methods with deep learning to denoise while preserving speech quality. Sample denoising results include keyboard noise and wind/road noise.

Keyboard noise result
Keyboard noise result
Keyboard noise result 2
Keyboard noise result 2
Wind and road noise result
Wind and road noise result
Wind and road noise result 2
Wind and road noise result 2
deep learningGRUmulti-head attentionspeech enhancementaudio denoisingTFLite
Douyu Streaming
Written by

Douyu Streaming

Official account of Douyu Streaming Development Department, sharing audio and video technology best practices.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.