Artificial Intelligence 16 min read

Speech Recognition in 58.com: Application Scenarios, Data Collection, Kaldi Chain Model Practice, and End‑to‑End Exploration

This article presents a comprehensive overview of how 58.com leverages large‑scale voice data from call‑center, private phone, and micro‑chat platforms, detailing data collection, annotation, Kaldi‑based chain model training, lattice‑free techniques, and end‑to‑end Transformer‑CTC models to improve Chinese speech recognition performance.

58 Tech

Aug 19, 2020

Speech Recognition in 58.com: Application Scenarios, Data Collection, Kaldi Chain Model Practice, and End‑to‑End Exploration

Voice is a crucial communication medium for 58.com users; both C‑side customers and B‑side merchants generate massive call recordings through the platform's call‑center, phone, and micro‑chat services. These recordings are converted to text via speech recognition for downstream analysis such as intent extraction and quality inspection.

The main sources of voice data are:

Call‑center (sales and customer service calls)

Phone platform (private conversations between users and merchants)

Micro‑chat platform (voice messages)

To process this data, 58.com built two analysis systems: the Lingxi voice analysis platform (intelligent QA, emotion detection, gender identification, user profiling) and an intelligent outbound calling platform that streams voice through VAD, ASR, NLU, and dialogue management.

The speech‑recognition engine focuses on offline audio file decoding. After legality checks, audio is resampled, channel‑converted, and subjected to VAD and speaker separation. Segmented clips are decoded, punctuation‑added in post‑processing, and passed to downstream NLP tasks.

Challenges include diverse business domains (recruitment, classifieds, automotive, real‑estate) with varying vocabularies, noisy environments (factories, subways, malls), and nationwide accents. To address these, a dedicated annotation system collects and labels domain‑specific data, performs new‑word mining, and trains language models.

Kaldi’s chain model, built on lattice‑free training, replaces the fixed lattice denominator with a phoneme‑level language model, forming a compact decoding graph (HCP). The acoustic model uses TDNN with frame‑skip (e.g., 3‑frame subsampling) and is trained with a multi‑task objective combining lattice‑free loss and cross‑entropy.

End‑to‑end approaches are also explored. A Transformer‑CTC model augments a Seq2Seq encoder‑decoder with two convolutional down‑sampling layers, uses attention for alignment, and incorporates CTC for monotonic alignment. During decoding, shallow‑fusion with an external RNN language model further improves accuracy.

Experimental results show that the chain model and the Transformer‑CTC end‑to‑end system achieve significant word‑error‑rate reductions compared with baseline GMM‑HMM systems, demonstrating the effectiveness of lattice‑free training, multi‑task loss, and language‑model integration for large‑scale Chinese ASR.

Author: Zhou Wei, senior AI Lab engineer at 58.com, focusing on speech‑recognition research and development.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

deep learning Speech Recognition End-to-End ASR Chinese Kaldi chain model

Written by

58 Tech

Official tech channel of 58, a platform for tech innovation, sharing, and communication.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.