Artificial Intelligence 15 min read

Tag Mining and Optimization Practices Using Chinese Segmentation Tools

This article presents a comprehensive overview of tag mining practices—including similarity‑based, compound‑word, topic, hot‑search, and image‑based approaches—along with detailed evaluations of Chinese segmentation tools and systematic tag optimization techniques such as synonym and negative‑word detection.

58 Tech

Feb 26, 2020

Tag Mining and Optimization Practices Using Chinese Segmentation Tools

The article introduces practical methods for tag mining, covering both model‑based and statistical‑rule‑based techniques, and aims to provide insights and guidance for improving tag quality.

Background : Tags are concise feature words or phrases extracted from post content to structure and highlight posts, enhancing user experience. The team evaluated several Chinese word‑segmentation tools and selected HanLP as the primary tokenizer for tag mining.

Chinese Segmentation : Chinese segmentation splits character sequences into words, a necessary step because Chinese lacks explicit delimiters. Four open‑source tools—HanLP, IK, Ansj, and Jieba—were compared using the SIGHAN and People’s Daily 2014 datasets, with metrics Precision, Recall, F‑Measure, and ErrorRate.

分词系统

Precision

Recall

F-Measure

ErrorRate

HanLP

0.803

0.197

0.739

0.741

0.740

0.262

Ansj(crf)

0.927

0.929

0.928

0.074

jieba

0.803

0.804

0.803

0.198

From Table 1, Ansj shows the best overall performance, while HanLP and Jieba are comparable and IK performs the worst. Table 2, based on the People’s Daily dataset, indicates that HanLP achieves the highest scores.

Tag Mining Evolution : After selecting HanLP, the team applied several mining strategies on Yellow Page post logs, including similarity‑based mining, compound‑word mining, topic‑word mining (LDA), hot‑search mining, and image‑based mining.

1. Similarity Mining : Using seed tags provided by product teams, the pipeline reads recent logs, filters posts, tokenizes with HanLP, removes certain POS tags, extracts TF‑IDF keywords, trains word‑vector models, and selects words similar to seeds as candidate tags.

2. Compound‑Word Mining : New words are formed by combining minimal tokens from the segmentation output. Statistical rules compute cohesion, freedom, and frequency to decide if a combination should be treated as a word.

3. Topic‑Word Mining : Posts are pre‑processed and an LDA model is trained to extract topic words per category, but results were unsatisfactory and not recommended.

4. Hot‑Search Mining : Top‑N search terms from user logs are cleaned and normalized, then used as candidate tags.

5. Image Tag Mining : Top‑N posts are sent to a third‑party image‑recognition service; resulting image tags that meet rules become candidates.

Tag Optimization : After generating many tags, the team identified issues such as synonym redundancy and negative tags, and introduced optimization steps.

Synonym Mining : Methods include using the Cilin synonym lexicon, character‑order reversal rules (e.g., “空调维修” vs. “维修空调”), edit‑distance thresholds (Levenshtein or Jaro), third‑party synonym services, and word‑embedding similarity (with fastText‑style n‑gram handling for OOV words).

Negative‑Word Mining : Negation words, POS templates, and frequency analysis are used to detect negative tags, though manual review is required due to higher error rates.

Finally, tag weights are updated daily based on user interaction feedback to reflect dynamic relevance.

Conclusion : The article summarizes the practical tag‑mining workflow, covering model‑based and rule‑based methods, and presents synonym and negative‑word strategies for tag optimization.

References : 1. Evaluation of 11 open‑source Chinese segmentation engines (https://www.cnblogs.com/croso/p/5349517.html) 2. Text data mining based on SNS (http://www.matrix67.com/blog/archives/5044)

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

NLP HanLP chinese segmentation label optimization tag mining

Written by

58 Tech

Official tech channel of 58, a platform for tech innovation, sharing, and communication.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.