Artificial Intelligence 18 min read

High‑Quality Automatic Speech Recognition (ASR) Solutions at Bilibili: Data, Model, and Deployment Optimizations

Bilibili’s high‑quality ASR system combines large‑scale filtered business data, semi‑supervised Noisy‑Student training, an end‑to‑end CTC model with lattice‑free MMI decoding, and FP16‑optimized FasterTransformer inference on Triton, delivering top‑ranked accuracy, low latency, and scalable deployment for diverse Chinese‑English video content.

Bilibili Tech
Bilibili Tech
Bilibili Tech
High‑Quality Automatic Speech Recognition (ASR) Solutions at Bilibili: Data, Model, and Deployment Optimizations

Automatic Speech Recognition (ASR) technology has been deployed at Bilibili for large‑scale business scenarios such as audio‑video content safety review, AI subtitles (C‑side, Bianjian, S12 live), and video understanding (full‑text retrieval).

Bilibili’s ASR engine also achieved the first place in the 2022 SpeechIO benchmark ( https://github.com/SpeechColab/Leaderboard ).

全部测试集排名

排名

厂家

字错误率

1

B站

2.82%

2

阿里云

2.85%

3

依图

3.16%

4

微软

3.28%

5

腾讯

3.85%

6

讯飞

4.05%

7

思必驰

5.19%

8

百度

8.14%

A high‑quality, cost‑effective ASR engine for industrial production should have the following characteristics:

High accuracy and robustness in target business scenarios.

High performance: low latency, fast speed, and low resource consumption.

High scalability: efficient support for business‑driven customization and rapid updates.

Data cold‑start is a major challenge because the ASR system requires a large and diverse training set (different acoustic environments, domains, and accents). Bilibili faces three specific difficulties:

Cold start: only a tiny amount of open‑source data is initially available, and purchased data has low relevance to the business.

Broad domain coverage: dozens of content categories demand high data diversity.

Mixed Chinese‑English content: many user‑generated videos contain both languages.

Solutions include business data filtering (cleaning timestamps, aligning sentences, handling numeric conversions) and semi‑supervised training using Noisy Student Training (NST). Approximately 500 k raw videos were filtered to generate ~40 k h of automatically labeled data, which, combined with 15 k h of manually labeled data, improved recognition accuracy by about 15 %.

ASR technology evolution can be divided into three stages:

1993‑2009: HMM‑GMM era, slow progress, high word error rates.

2009‑2015: Deep learning rise, HMM‑DNN hybrid models, significant accuracy gains.

2015‑present: End‑to‑end (E2E) models (CTC, AED, RNNT) with large, complex networks, sometimes surpassing human performance.

Comparison of hybrid and end‑to‑end frameworks:

混合框架 (hybrid)

端到端框架 (E2E)

HTK, Kaldi

Espnet, Wenet, DeepSpeech, K2

C/C++, Shell

Python, Shell

从头开发

TensorFlow / PyTorch

Typical CER results on representative datasets:

Librispeech

GigaSpeech

Aishell‑1

WenetSpeech

Hybrid (Kaldi Chain + LM)

3.06

14.84

7.43

12.83

E2E‑AED

11.8

6.6

4.72

E2E‑RNNT

12.4

Optimized E2E‑CTC

7.1

5.8

Based on the analysis, Bilibili adopts an end‑to‑end CTC system with a dynamic decoder to meet high‑throughput, low‑latency, and high‑accuracy requirements across diverse scenarios.

End‑to‑end lattice‑free MMI discriminative training further improves timestamp accuracy and overall CER. Results on Bilibili’s video test set:

Model

CER (%)

CTC baseline

6.96

Traditional DT

6.63

E2E LF‑MMI DT

6.13

The end‑to‑end decoder, based on beam search, consumes only about 1/5 of the resources of a traditional WFST decoder and is 5× faster, while being easier to customize with external language models.

Model inference is optimized by using FP16 precision, converting the model to FasterTransformer, and deploying with Triton for automatic batch sizing. On a single NVIDIA T4 GPU, throughput increases by 2× and speed improves by 30 % (≈3000 h of audio per hour).

Summary

This article presents Bilibili’s end‑to‑end ASR solution, covering data cold‑start, semi‑supervised training, model algorithm optimizations, decoder design, and inference deployment. Future work includes hot‑word integration, entity‑level accuracy improvements, and real‑time streaming ASR for games and sports events.

References

A. Baevski, H. Zhou, et al., “wav2vec 2.0: A Framework for Self‑Supervised Learning of Speech Representations.”

A. Baevski, W. Hsu, et al., “data2vec: A General Framework for Self‑supervised Learning in Speech, Vision and Language.”

D. S., Y. Zhang, et al., “Improved Noisy Student Training for Automatic Speech Recognition.”

C. Lüscher, E. Beck, et al., “RWTH ASR Systems for LibriSpeech: Hybrid vs Attention – w/o Data Augmentation.”

R. Prabhavalkar, K. Rao, et al., “A Comparison of Sequence‑to‑Sequence Models for Speech Recognition.”

D. Povey, V. Peddinti, et al., “Purely sequence‑trained neural networks for ASR based on lattice‑free MMI.”

H. Xiang, Z. Ou, “CRF‑Based Single‑Stage Acoustic Modeling with CTC Topology.”

Z. Chen, W. Deng, et al., “Phone Synchronous Decoding with CTC Lattice.”

https://github.com/NVIDIA/FasterTransformer

machine learningmodel optimizationSpeech Recognitionend-to-endASRBilibili
Bilibili Tech
Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.