Artificial Intelligence 14 min read

Multi-Task Audio Source Separation (MTASS) and SpeechNAS: AutoML‑Driven Large‑Scale Speaker Recognition

This article presents two ASRU‑2021 accepted works from Kuaishou: MTASS, a multi‑task audio source separation framework that jointly separates speech, music and noise, and SpeechNAS, an AutoML‑based neural architecture search method that achieves state‑of‑the‑art speaker recognition performance with significantly fewer parameters.

Kuaishou Tech
Kuaishou Tech
Kuaishou Tech
Multi-Task Audio Source Separation (MTASS) and SpeechNAS: AutoML‑Driven Large‑Scale Speaker Recognition

Kuaishou’s short‑video platform mixes speech, music, sound effects and background noise, creating a challenging audio environment; to address this, the team proposes two novel techniques—Multi‑Task Audio Source Separation (MTASS) and SpeechNAS, both accepted at ASRU 2021.

MTASS introduces the first multi‑task audio separation task that unifies speech enhancement, speech separation and music separation within a single model, enabling simultaneous extraction of clean speech, music and residual noise from complex mixtures.

SpeechNAS is the first successful application of neural architecture search to large‑scale speaker recognition; on VoxCeleb1 it attains comparable accuracy to the best models while using only 69% of the parameters.

The MTASS‑Dataset contains 55.6 h of training data (10‑second clips) with balanced speech, music and noise, sampled at 16 kHz, and includes separate development and test sets.

Complex‑MTASSNet, the proposed model, operates in the complex frequency domain with a two‑stage design: a separation module based on multi‑scale TCNs and a residual‑signal compensation module that refines each source’s leakage using a gated TCN.

Experiments on the MTASS‑Dataset show that Complex‑MTASSNet outperforms baseline methods (GCRN, Conv‑TasNet, Demucs, D3Net) in SDRi (12.57 dB for speech, 9.86 dB for music, 8.42 dB for noise) while maintaining a modest 28.18 M parameter count and low MAC/S, delivering real‑time performance on both CPU and GPU.

SpeechNAS defines a search space over branch number, feature dimension and channel‑selection dimension on a D‑TDNN backbone, uses Bayesian optimization to find optimal sub‑networks, and trains them with a mixed additive‑margin softmax and minimum hyperspherical energy loss; the resulting models (SpeechNAS‑1 … 5) achieve lower equal‑error rates than prior art with up to 31% fewer parameters.

In summary, the MTASS framework provides a versatile front‑end for audio processing that benefits downstream tasks such as speech recognition and music retrieval, while SpeechNAS demonstrates the practical impact of AutoML in large‑scale speaker verification across multiple Kuaishou services.

AutoMLNeural Architecture Searchaudio separationMTASSspeaker recognitionSpeechNAS
Kuaishou Tech
Written by

Kuaishou Tech

Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.