Improving Text Matching Accuracy in Voice Assistants: Experiments with Siamese Networks, BERT Models, and Advanced Tricks
This article evaluates classic Siamese networks, various BERT‑based pretrained models, and several training tricks such as adversarial training, k‑fold cross‑validation, and model ensembling on both a public similarity‑sentence competition dataset and an internal voice‑assistant standard question matching dataset, ultimately raising accuracy from 97.23 % to 99.5 %.
Background – Natural language processing includes tasks like text matching, which is crucial for voice‑assistant dialogue management. The study uses a 40,000‑sample voice‑assistant dataset (balanced positive/negative, split 8:1:1) to benchmark classic Siamese architectures and modern pretrained models.
Experimental Design – The authors reproduced top solutions from the Alibaba DAMO‑Lab COVID‑sentence‑pair competition, testing four stages: (1) classic Siamese networks (SiameseLSTM, ABCNN, BiMPM, ESIM); (2) Chinese BERT‑style pretrained models (bert_wwm_ext, roberta_wwm_large, ERNIE); (3) text‑matching tricks (adversarial training, k‑fold cross‑validation, model ensemble); (4) comparison on both competition and voice‑assistant datasets.
Classic Siamese Networks
Strategy
Dataset
Accuracy
SiameseLSTM
Competition
0.8323
SiameseLSTM
Voice Assistant
0.9723
ABCNN
Competition
0.8528
ABCNN
Voice Assistant
0.9745
BiMPM
Competition
0.8916
BiMPM
Voice Assistant
0.9848
ESIM
Competition
0.9077
ESIM
Voice Assistant
0.9880
The progression from SiameseLSTM to ESIM shows steady gains, with attention mechanisms and richer matching strategies improving performance.
BERT Series Pretrained Models
Strategy
Dataset
Accuracy
bert_wwm_ext
Competition
0.9496
ernie
Competition
0.9499
roberta_wwm_large
Competition
0.9502
bert_wwm_ext
Voice Assistant
0.9938
ernie
Voice Assistant
0.9940
roberta_wwm_large
Voice Assistant
0.9930
All BERT‑style models significantly outperform classic Siamese networks, with the best competition accuracy of 95.02 % and voice‑assistant accuracy of 99.30 %.
Tricks for Further Improvement
Adversarial training (FGM, PGD) – adds small perturbations to embeddings, improving robustness.
K‑fold cross‑validation – trains multiple models on different folds and aggregates predictions.
Model ensemble (bagging) – combines predictions from multiple BERT variants.
Strategy
Dataset
Accuracy
ernie + external data + transfer
Competition
0.9559
ernie + external data + transfer + pgd
Competition
0.9576
ernie + external data + transfer + pgd
Voice Assistant
0.9948
ernie + 5‑fold CV
Competition
0.9515
Model ensemble (bert_wwm_ext, ernie, roberta_wwm_large)
Competition
0.9585
Model ensemble (same three models)
Voice Assistant
0.9950
These tricks consistently raise performance; the final ensemble reaches 95.85 % accuracy on the competition set and 99.50 % on the voice‑assistant set.
Conclusion – BERT‑based models are the dominant solution for Chinese text matching in voice assistants. Incorporating adversarial training, k‑fold validation, and ensemble methods yields further gains, demonstrating a practical roadmap for deploying high‑accuracy NLU components.
Author – Yin Zilong, AI Lab algorithm engineer at 58.com, specializing in voice data analysis and intelligent writing.
References
ABCNN: https://arxiv.org/pdf/1512.05193.pdf
BiMPM: https://arxiv.org/abs/1702.03814
ESIM: https://arxiv.org/abs/1609.06038
ERNIE 1.0: https://arxiv.org/pdf/1904.09223.pdf
Chinese BERT‑WWM: https://github.com/ymcui/Chinese-BERT-wwm
Adversarial training: https://zhuanlan.zhihu.com/p/91269728
58 Tech
Official tech channel of 58, a platform for tech innovation, sharing, and communication.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.