Artificial Intelligence 11 min read

Design and Implementation of Intent Recognition, Semantic Similarity Matching, and Slot Filling for a Voice Robot

This article details the architecture and algorithms behind a voice robot's natural language understanding module, covering single‑sentence intent classification with TextCNN, acoustic quality detection using VGGish‑BiLSTM, semantic similarity matching via DSSM and TextCNN‑Transformer, and slot‑filling with IDCNN‑CRF, along with performance results and future directions.

58 Tech

Oct 16, 2019

Design and Implementation of Intent Recognition, Semantic Similarity Matching, and Slot Filling for a Voice Robot

The voice robot, developed by 58 Tongcheng AI Lab, requires real‑time understanding of user speech to drive multi‑turn dialogues; semantic intent serves as the input to the dialog management system, enabling the robot to label user utterances for downstream processing.

For single‑sentence intent detection, 19 intent labels (mainline, generic, and reject intents) are defined. A TextCNN text‑classification model was selected after experiments with BiLSTM, FastText, and BERT, achieving an overall accuracy of 75% and 97% when ASR‑related errors are excluded.

To mitigate the impact of poor audio quality, a VGGish + BiLSTM acoustic model classifies sound types (clear speech, deafness, noise, etc.), improving intent accuracy for unclear recordings and reaching 86% correctness on the most challenging “deafness” category.

Semantic similarity matching handles user queries that are not directly mapped to intent labels. A DSSM‑based approach converts sentences into semantic vectors; later stages adopt a TextCNN + Transformer architecture, yielding 83.39% accuracy and 80.1% recall for matching standard questions and keyword‑based QA.

Slot filling, required for extracting key entities (e.g., car brand, model) from user utterances, is implemented with an IDCNN + CRF model. The pipeline starts from an ontology library, uses Trie‑based keyword spotting, and refines annotations; the model attains a 95% chunk‑level F1 score.

The combined NLU components enable a robust, configurable dialog system for voice‑based recruitment calls, and future work will incorporate richer acoustic features to further improve performance in noisy, conversational environments.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI intent recognition TextCNN semantic similarity NLU slot filling

Written by

58 Tech

Official tech channel of 58, a platform for tech innovation, sharing, and communication.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.