Large-Scale Short Text Clustering System Design and Practice at Baidu Search
At Baidu Search, a large‑scale short‑text clustering system was built using multi‑level semantic splitting, fine‑grained aggregation and error‑correction, evolving from v1.0 to v2.0, and now clusters 100 million queries with 95 % accuracy and 80 % recall within three days.
Background: Search is a scenario where users explicitly express their needs. Baidu Search handles massive requests daily, with queries expressed in various ways. The core purpose of a large-scale short text clustering system is to summarize and归纳 short texts represented by queries, efficiently and precisely clustering short texts with the same meaning but different expressions into semantically cohesive and clearly expressed "demands". This approach not only compresses the volume of short texts but also better meets user needs and helps content providers deliver better content. The system has already assisted Baidu's UGC product content production.
Short Text Clustering Problem: Clustering is a common unsupervised algorithm that divides a dataset into different classes or clusters based on a distance metric, maximizing the similarity of data objects within the same cluster while maximizing the differences between data objects in different clusters. For search queries primarily consisting of short texts, the goal is to aggregate all texts with consistent meanings into a "demand cluster", which is the short text clustering problem.
Common Algorithms:
SimHash: A local-sensitive hashing algorithm commonly used in text clustering for web page deduplication. It attempts to map similar documents to the same hash value with high probability. However, for short text clustering, due to significantly reduced text length, SimHash's effectiveness is greatly reduced.
Vectorization Clustering: First vectorizes text, then applies conventional clustering methods. Text vectorization methods include tf-idf, word2vec, and pre-trained models like BERT and ERNIE. Clustering methods include kmeans, hierarchical clustering, and single-pass. Issues include: 1) kmeans requires the hyperparameter "number of clusters"; 2) For short text clustering, the number of clusters is often very large, causing severe performance degradation; 3) Accuracy is affected by both vectorization and clustering algorithms.
Challenges of Large-Scale Short Text Clustering:
High accuracy requirements: Clustering algorithms are unsupervised, using distance metrics in vector space to measure clustering results. However, in search query scenarios, clustering has clear evaluation metrics: through unified text similarity evaluation standards, clustering accuracy can be post-hoc evaluated.
Large data scale: Search query scale is enormous; efficient computation for hundreds of millions of data points tests algorithm design and engineering capabilities.
High precision and low latency for text similarity: Text similarity is a scenario-dependent problem. In search scenarios, precision requirements are very high; often a single character difference represents completely different needs.
Complex text representation: Text representation methods include weighted hash functions in SimHash, word2vec vectors, and category/keyword information.
Error detection and correction: Each step from text representation to text similarity to clustering accumulates errors.
Overall Approach: Multi-level splitting, attacking each challenge individually:
Multi-level splitting: First split large-scale short texts into multiple levels, ensuring queries with the same meaning enter the same bucket with high probability. Level-1 splitting ensures semantic mutual exclusion; Level-2 splitting ensures manageable computation scale.
Fine-grained semantic aggregation: For queries in the same bucket, perform fine-grained semantic aggregation, merging queries with the same meaning into one cluster.
Error correction: Perform error checking on semantic clusters within the same level-1 bucket, merging clusters that need merging.
Baidu Geek Talk
Follow us to discover more Baidu tech insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.