Tag Mining and Optimization Practices Using Chinese Segmentation Tools
This article presents a comprehensive overview of tag mining practices—including similarity‑based, compound‑word, topic, hot‑search, and image‑based approaches—along with detailed evaluations of Chinese segmentation tools and systematic tag optimization techniques such as synonym and negative‑word detection.
The article introduces practical methods for tag mining, covering both model‑based and statistical‑rule‑based techniques, and aims to provide insights and guidance for improving tag quality.
Background : Tags are concise feature words or phrases extracted from post content to structure and highlight posts, enhancing user experience. The team evaluated several Chinese word‑segmentation tools and selected HanLP as the primary tokenizer for tag mining.
Chinese Segmentation : Chinese segmentation splits character sequences into words, a necessary step because Chinese lacks explicit delimiters. Four open‑source tools—HanLP, IK, Ansj, and Jieba—were compared using the SIGHAN and People’s Daily 2014 datasets, with metrics Precision, Recall, F‑Measure, and ErrorRate.
分词系统
Precision
Recall
F-Measure
ErrorRate
HanLP
0.803
0.803
0.803
0.197
IK
0.739
0.741
0.740
0.262
Ansj(crf)
0.927
0.929
0.928
0.074
jieba
0.803
0.804
0.803
0.198
From Table 1, Ansj shows the best overall performance, while HanLP and Jieba are comparable and IK performs the worst. Table 2, based on the People’s Daily dataset, indicates that HanLP achieves the highest scores.
Tag Mining Evolution : After selecting HanLP, the team applied several mining strategies on Yellow Page post logs, including similarity‑based mining, compound‑word mining, topic‑word mining (LDA), hot‑search mining, and image‑based mining.
1. Similarity Mining : Using seed tags provided by product teams, the pipeline reads recent logs, filters posts, tokenizes with HanLP, removes certain POS tags, extracts TF‑IDF keywords, trains word‑vector models, and selects words similar to seeds as candidate tags.
2. Compound‑Word Mining : New words are formed by combining minimal tokens from the segmentation output. Statistical rules compute cohesion, freedom, and frequency to decide if a combination should be treated as a word.
3. Topic‑Word Mining : Posts are pre‑processed and an LDA model is trained to extract topic words per category, but results were unsatisfactory and not recommended.
4. Hot‑Search Mining : Top‑N search terms from user logs are cleaned and normalized, then used as candidate tags.
5. Image Tag Mining : Top‑N posts are sent to a third‑party image‑recognition service; resulting image tags that meet rules become candidates.
Tag Optimization : After generating many tags, the team identified issues such as synonym redundancy and negative tags, and introduced optimization steps.
Synonym Mining : Methods include using the Cilin synonym lexicon, character‑order reversal rules (e.g., “空调维修” vs. “维修空调”), edit‑distance thresholds (Levenshtein or Jaro), third‑party synonym services, and word‑embedding similarity (with fastText‑style n‑gram handling for OOV words).
Negative‑Word Mining : Negation words, POS templates, and frequency analysis are used to detect negative tags, though manual review is required due to higher error rates.
Finally, tag weights are updated daily based on user interaction feedback to reflect dynamic relevance.
Conclusion : The article summarizes the practical tag‑mining workflow, covering model‑based and rule‑based methods, and presents synonym and negative‑word strategies for tag optimization.
References : 1. Evaluation of 11 open‑source Chinese segmentation engines (https://www.cnblogs.com/croso/p/5349517.html) 2. Text data mining based on SNS (http://www.matrix67.com/blog/archives/5044)
58 Tech
Official tech channel of 58, a platform for tech innovation, sharing, and communication.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.