Speaker Verification System for Detecting Spam Calls in 58 Used‑Car Platform
This article describes how the 58 used‑car team built a speaker‑verification pipeline—covering data collection, MFCC feature extraction, LSTM and GMM modeling, threshold tuning, multi‑speaker clustering, and deployment results—to automatically block nuisance telemarketing calls while preserving user privacy.
Background The rapid growth of online used‑car listings on 58 created a large pool of displayed phone numbers, which were abused by telemarketing robots that called users with pre‑recorded messages to obtain personal phone numbers. To protect user privacy, the team developed a voice‑biometrics system that identifies and blocks such spam callers.
System Overview The solution consists of three parts: a model‑file repository for storing different training stages, an API exposing the verification service, and a manual‑labeling workflow for correcting mis‑classifications. Phone call recordings provided by carriers are split into right‑channel (caller) audio and processed before model inference.
Audio Processing Pipeline Raw MP3 recordings are converted to WAV, the right channel is extracted, and long recordings are segmented using a 700 ms silence threshold. Voice‑activity detection (VAD) removes silent portions, and the resulting clips are limited to 15 s. MFCC features (13 coefficients plus first‑ and second‑order deltas, 39‑dimensional) are extracted for each segment.
Model Training A two‑layer bidirectional LSTM (sigmoid activation, batch size 16, 1000 epochs) achieved 0.96 accuracy on the training set and >0.80 on the test set. To improve recall, a Gaussian Mixture Model (GMM) with a threshold around 5 was also trained; the combination of LSTM and GMM provided the best trade‑off (LSTM threshold 0.001, GMM 5.4).
Evaluation After deployment, daily blocked calls dropped from ~400 to <200, and complaint rates fell from ~100 per week to ~10. The system’s precision ranged from 0.65 to 0.85, with occasional over‑fitting due to limited negative samples.
Multi‑Speaker Extension To handle recordings containing multiple speakers, the team explored embedding‑based clustering: MFCC‑derived embeddings are fed to a batch‑wise LSTM, normalized with L2, and clustered using cosine similarity. Preliminary results showed over‑fitting caused by sparse data and noisy segments; future work will focus on larger corpora and noise removal.
Conclusion The speaker‑verification pipeline—spanning signal processing, deep learning, and statistical modeling—proved effective for reducing spam calls in the 58 used‑car ecosystem, and ongoing research aims to extend the approach to multi‑speaker scenarios to further reduce manual labeling effort.
58 Tech
Official tech channel of 58, a platform for tech innovation, sharing, and communication.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.