Artificial Intelligence 11 min read

Design and Implementation of Intent Recognition, Semantic Similarity Matching, and Slot Filling for a Voice Robot

This article details the architecture and algorithms behind a voice robot's natural language understanding module, covering single‑sentence intent classification with TextCNN, acoustic quality detection using VGGish‑BiLSTM, semantic similarity matching via DSSM and TextCNN‑Transformer, and slot‑filling with IDCNN‑CRF, along with performance results and future directions.

58 Tech
58 Tech
58 Tech
Design and Implementation of Intent Recognition, Semantic Similarity Matching, and Slot Filling for a Voice Robot

The voice robot, developed by 58 Tongcheng AI Lab, requires real‑time understanding of user speech to drive multi‑turn dialogues; semantic intent serves as the input to the dialog management system, enabling the robot to label user utterances for downstream processing.

For single‑sentence intent detection, 19 intent labels (mainline, generic, and reject intents) are defined. A TextCNN text‑classification model was selected after experiments with BiLSTM, FastText, and BERT, achieving an overall accuracy of 75% and 97% when ASR‑related errors are excluded.

To mitigate the impact of poor audio quality, a VGGish + BiLSTM acoustic model classifies sound types (clear speech, deafness, noise, etc.), improving intent accuracy for unclear recordings and reaching 86% correctness on the most challenging “deafness” category.

Semantic similarity matching handles user queries that are not directly mapped to intent labels. A DSSM‑based approach converts sentences into semantic vectors; later stages adopt a TextCNN + Transformer architecture, yielding 83.39% accuracy and 80.1% recall for matching standard questions and keyword‑based QA.

Slot filling, required for extracting key entities (e.g., car brand, model) from user utterances, is implemented with an IDCNN + CRF model. The pipeline starts from an ontology library, uses Trie‑based keyword spotting, and refines annotations; the model attains a 95% chunk‑level F1 score.

The combined NLU components enable a robust, configurable dialog system for voice‑based recruitment calls, and future work will incorporate richer acoustic features to further improve performance in noisy, conversational environments.

AIintent recognitionTextCNNsemantic similarityNLUslot fillingVoice Bot
58 Tech
Written by

58 Tech

Official tech channel of 58, a platform for tech innovation, sharing, and communication.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.