Artificial Intelligence 10 min read

qa_match V1.3: Lightweight Deep Learning QA Matching Tool with Semi‑Automatic Knowledge‑Base Mining and Transformer‑Enhanced Pre‑training

The qa_match open‑source tool from 58 Tongcheng, now at version 1.3, introduces semi‑automatic knowledge‑base mining for cold‑start and online scenarios and upgrades its Simple Pre‑trained Model (SPTM) with Transformer‑based feature representation to improve question‑answer matching performance.

58 Tech
58 Tech
58 Tech
qa_match V1.3: Lightweight Deep Learning QA Matching Tool with Semi‑Automatic Knowledge‑Base Mining and Transformer‑Enhanced Pre‑training

qa_match is a lightweight deep‑learning based question‑answer matching tool released by 58 Tongcheng; version 1.0 launched on March 9 2020, v1.1 in June 2020, and the latest v1.3 in December 2020. The project is open‑source on GitHub (https://github.com/wuba/qa_match) under the Apache License 2.0.

Version 1.3 adds two major features: (1) a semi‑automatic knowledge‑base mining workflow that supports both cold‑start knowledge acquisition and post‑deployment question expansion, and (2) an enhanced Simple Pre‑trained Model (SPTM) that incorporates Transformer‑based feature representations.

The upgrade addresses limitations of earlier releases, which relied on a single‑layer knowledge base and a Bi‑LSTM pre‑training model. By introducing knowledge‑base mining and a Transformer‑augmented SPTM, the system achieves better downstream QA performance.

The semi‑automatic knowledge‑base mining module builds on the existing qa_match pipeline, using the Deep Embedding Clustering (DEC) algorithm (based on SPTM embeddings) to discover standard and expanded questions. In cold‑start, it creates an initial knowledge base from unlabeled data via DEC; after deployment, it continuously extracts new utterances using custom cluster centers.

DEC, originally presented at ICML 2016, jointly learns feature representations and cluster assignments. The implementation replaces the original auto‑encoder with SPTM embeddings and allows custom cluster centers to provide supervised signals, reducing randomness in clustering.

Cold‑start mining scenario: when a new business integrates automatic QA, historical unsupervised data is clustered with DEC to generate standard questions and expanded utterances. The process flow is illustrated in the following diagram:

Post‑deployment mining scenario: after the QA matching model is online, the existing knowledge base is enriched by clustering new user queries with DEC using custom centers, thereby expanding coverage and improving precision‑recall. The workflow diagram is:

Evaluation of the clustering algorithm uses both external and internal metrics; internal quality is measured by the silhouette coefficient. Results are shown in the table below:

Dataset

Model

Silhouette

Runtime

Inference Time

1w

DEC

0.7962

30 min

52 s

10w

DEC

0.9302

3 h 5 min

5 min 55 s

100w

DEC

0.849

11 h 30 min

15 min 28 s

The upgraded SPTM incorporates a shared‑parameter Transformer encoder. Input consists of Word‑Aware token embeddings and Position‑Aware embeddings. The shared Transformer encoder uses multi‑head attention and feed‑forward layers with residual connections, enhancing representation capacity while keeping parameter count low.

Pre‑training time for the Transformer‑based SPTM is summarized in the following figure:

Future plans include releasing a TensorFlow 2.4 compatible version and providing X‑version or PyTorch implementations of qa_match.

Contributions are welcomed via GitHub pull requests or issues (https://github.com/wuba/qa_match.git) and by emailing ailab‑[email protected].

Authors: Lv Yuan‑yuan, Wang Yong, and He Rui – senior algorithm engineers and architect at 58 Tongcheng AI Lab, responsible for the intelligent QA research and development.

References: [1] https://github.com/wuba/qa_match#基于一层结构知识库的自动问答 [2] https://github.com/wuba/qa_match/tree/v1.1#基于sptm模型的自动问答 [3] Xie, Junyuan, Ross Girshick, and Ali Farhadi. "Unsupervised deep embedding for clustering analysis." ICML 2016.

Deep LearningtransformerOpen Sourcequestion answeringDEC clusteringknowledge base miningqa_match
58 Tech
Written by

58 Tech

Official tech channel of 58, a platform for tech innovation, sharing, and communication.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.