Artificial Intelligence 16 min read

Improving Text Representation and Clustering for Small‑Sample Scenarios in 58 Second‑Hand Car Intelligent Customer Service

This article presents a study on enhancing text representation and clustering in a small‑sample setting for 58's second‑hand car intelligent customer service by introducing a Bi‑LSTM based pre‑training language model and an improved Deep Embedded Clustering (DEC) algorithm, demonstrating significant gains in accuracy, silhouette score, and answer‑rate through extensive experiments.

DataFunTalk
DataFunTalk
DataFunTalk
Improving Text Representation and Clustering for Small‑Sample Scenarios in 58 Second‑Hand Car Intelligent Customer Service

Background 58.com’s intelligent customer service system ("BangBang") provides automated Q&A, human‑online chat, and smart assistance across various business lines. In the second‑hand car domain, the system faces challenges due to weak text representation and low clustering purity when only a small amount of labeled data is available.

Problem Statement The key issues are (1) insufficient representation of diverse user queries in a small‑sample scenario, leading to poor model generalization, and (2) difficulty discovering new user questions to improve coverage.

Proposed Solutions Two algorithms are explored: (1) a Bi‑LSTM based pre‑training language model that adapts BERT’s masked‑LM task to the vertical domain and replaces the transformer with Bi‑LSTM for lower computational cost; (2) the Deep Embedded Clustering (DEC) algorithm, which jointly learns feature representations and cluster assignments.

Bi‑LSTM Pre‑training Model The model is trained on 40 million unlabeled second‑hand car sentences using only the masked‑LM task, adds residual connections and layer‑norm between Bi‑LSTM layers, and is trained on a single NVIDIA TESLA P40 GPU for 30 k iterations (~28 h). Experiments on a 26 k‑sample classification task show accuracy improvement from 0.81 to 0.86, outperforming a BERT‑based baseline (0.8487).

Model

+Pretrain Acc

No Pretrain Acc

Bi‑LSTM

0.8662

0.8107

BERT

0.8487 (5 epoch) / 0.8530 (10 epoch)

0.7884 (5 epoch) / 0.8342 (10 epoch)

DEC Algorithm Description DEC consists of two stages: (1) pre‑training an auto‑encoder to obtain initial features, (2) fine‑tuning the encoder together with cluster centroids using a KL‑divergence loss between a soft assignment distribution q and a target distribution p. The original K‑means initialization is replaced with custom centroids derived from the average vectors of existing standard questions, reducing randomness.

Experimental Comparison Three experiments were conducted on a small labeled dataset: (1) K‑means + Word2Vec static features, (2) K‑means + Bi‑LSTM static features, (3) DEC + Bi‑LSTM pre‑trained features. Results (Table 2) show that DEC with Bi‑LSTM achieves the highest accuracy (0.8437), silhouette score (0.142), albeit with longer runtime (30 min).

Method

Accuracy

Silhouette

Runtime

K‑means + Word2Vec

0.354

0.047

<5 min

K‑means + Bi‑LSTM

0.377

0.025

<5 min

DEC + Bi‑LSTM

0.8437

0.142

30 min

Impact on Online System Applying the improved DEC to the online Q&A robot uncovered new standard questions (e.g., “What does other fees include?”) and increased the weekly average answer‑rate from 79.71 % to 83.62 %.

Iterative Improvement of Question Expansion After deployment, analysis of bad cases revealed many variant utterances for existing standard questions. By customizing DEC centroids with averaged vectors of all known expansion queries, the system’s precision rose from 98.11 % to 98.24 % and recall from 89.66 % to 92.27 %.

Before Iteration

After Iteration

Precision

98.11 %

98.24 %

Recall

89.66 %

92.27 %

Conclusion and Future Work The study demonstrates that (1) a domain‑specific Bi‑LSTM pre‑training model markedly improves text representation for small‑sample NLP tasks, and (2) an enhanced DEC algorithm with custom centroids boosts clustering purity and downstream Q&A performance. Future directions include leveraging transfer learning between online/offline data, designing more suitable unsupervised objectives, and incorporating self‑supervision.

AINLPtext representationBi-LSTMDECdeep clusteringsmall sample
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.