Artificial Intelligence 16 min read

Speech Recognition in 58.com: Application Scenarios, Data Collection, Kaldi Chain Model Practice, and End‑to‑End Exploration

This article presents a comprehensive overview of how 58.com leverages large‑scale voice data from call‑center, private phone, and micro‑chat platforms, detailing data collection, annotation, Kaldi‑based chain model training, lattice‑free techniques, and end‑to‑end Transformer‑CTC models to improve Chinese speech recognition performance.

58 Tech
58 Tech
58 Tech
Speech Recognition in 58.com: Application Scenarios, Data Collection, Kaldi Chain Model Practice, and End‑to‑End Exploration

Voice is a crucial communication medium for 58.com users; both C‑side customers and B‑side merchants generate massive call recordings through the platform's call‑center, phone, and micro‑chat services. These recordings are converted to text via speech recognition for downstream analysis such as intent extraction and quality inspection.

The main sources of voice data are:

Call‑center (sales and customer service calls)

Phone platform (private conversations between users and merchants)

Micro‑chat platform (voice messages)

To process this data, 58.com built two analysis systems: the Lingxi voice analysis platform (intelligent QA, emotion detection, gender identification, user profiling) and an intelligent outbound calling platform that streams voice through VAD, ASR, NLU, and dialogue management.

The speech‑recognition engine focuses on offline audio file decoding. After legality checks, audio is resampled, channel‑converted, and subjected to VAD and speaker separation. Segmented clips are decoded, punctuation‑added in post‑processing, and passed to downstream NLP tasks.

Challenges include diverse business domains (recruitment, classifieds, automotive, real‑estate) with varying vocabularies, noisy environments (factories, subways, malls), and nationwide accents. To address these, a dedicated annotation system collects and labels domain‑specific data, performs new‑word mining, and trains language models.

Kaldi’s chain model, built on lattice‑free training, replaces the fixed lattice denominator with a phoneme‑level language model, forming a compact decoding graph (HCP). The acoustic model uses TDNN with frame‑skip (e.g., 3‑frame subsampling) and is trained with a multi‑task objective combining lattice‑free loss and cross‑entropy.

End‑to‑end approaches are also explored. A Transformer‑CTC model augments a Seq2Seq encoder‑decoder with two convolutional down‑sampling layers, uses attention for alignment, and incorporates CTC for monotonic alignment. During decoding, shallow‑fusion with an external RNN language model further improves accuracy.

Experimental results show that the chain model and the Transformer‑CTC end‑to‑end system achieve significant word‑error‑rate reductions compared with baseline GMM‑HMM systems, demonstrating the effectiveness of lattice‑free training, multi‑task loss, and language‑model integration for large‑scale Chinese ASR.

Author: Zhou Wei, senior AI Lab engineer at 58.com, focusing on speech‑recognition research and development.

Deep LearningSpeech Recognitionend-to-endASRchineseKaldichain model
58 Tech
Written by

58 Tech

Official tech channel of 58, a platform for tech innovation, sharing, and communication.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.