Artificial Intelligence 19 min read

End-to-End Speech Recognition Optimization and Deployment at 58.com

58.com’s AI Lab presents a comprehensive overview of its end‑to‑end speech recognition system, detailing data collection, semi‑supervised training, Efficient Conformer architecture, model compression, and deployment strategies that together achieve high accuracy across diverse acoustic conditions and large‑scale production workloads.

58 Tech
58 Tech
58 Tech
End-to-End Speech Recognition Optimization and Deployment at 58.com

The AISummit 2022 highlighted 58.com’s AI Lab achievements in speech technology, where the team has built a full‑stack voice platform covering voice bots, content analysis, intelligent outbound assistants, and quality‑inspection systems. Over millions of hours of audio from various business lines (recruitment, real‑estate, automotive, local services) are processed for downstream tasks such as quality control and user profiling.

Facing challenges like heavy accents, noisy environments, and domain‑specific vocabularies, the team first used Kaldi with a Chain Model (CNN+TDNN) and later explored end‑to‑end frameworks. ESPNet’s Transformer‑CTC hybrid showed promising accuracy but suffered from slow decoding and limited streaming support.

In 2022 the team adopted WeNet, an open‑source end‑to‑end ASR framework based on Conformer + CTC + Attention. WeNet’s lightweight codebase and ease of deployment made it the primary solution. Semi‑supervised training (Noisy Student) was employed: a teacher model generated pseudo‑labels for large unlabeled corpora, which were filtered by confidence score, model disagreement, words‑per‑second, and rare‑data heuristics before iteratively training a student model.

Experiments on a heavily accented real‑estate scenario (269 h labeled + 2482 h unlabeled) demonstrated that pseudo‑labeling significantly improved CER when labeled data were scarce. Additional tests on multi‑city accent sets and audio‑video moderation tasks confirmed the benefits of semi‑supervised learning for both WeNet and Kaldi models.

To accelerate inference, the team introduced Efficient Conformer, which inserts down‑sampling blocks inside Conformer layers, reducing sequence length by 1/8 while preserving accuracy. Grouped MHSA and strided convolutions further cut computational cost. Benchmarks showed 10‑13% CER improvement and 10‑70% speedup (including int8 quantization) for both streaming and non‑streaming scenarios.

Model compression was achieved via knowledge distillation: a large teacher model provided soft targets for a smaller student, followed by fine‑tuning on labeled data. The final compressed model retained acceptable accuracy with only 27.4% of the original parameters.

Deployment leveraged the WPAI AI platform, providing CPU/GPU Docker images, RPC interfaces, and fine‑grained hot‑word support. The architecture allows seamless switching between Kaldi and WeNet back‑ends, with careful handling of cache sizes for streaming Efficient Conformer.

In summary, 58.com’s AI Lab combined data‑driven semi‑supervised training, Efficient Conformer design, and model‑distillation techniques to build a robust, scalable speech recognition service that meets the diverse acoustic demands of its large‑scale online marketplace.

AImodel compressiondeploymentSemi-supervised LearningSpeech Recognitionend-to-endEfficient Conformer
58 Tech
Written by

58 Tech

Official tech channel of 58, a platform for tech innovation, sharing, and communication.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.