Tag Extraction for 58 Yellow Pages Posts Using Sequence Labeling and Model Optimization
This article describes a complete solution for extracting and normalizing tags from 58 Yellow Pages service posts, covering candidate word acquisition, sequence‑labeling models such as CRF and BERT‑CRF, hierarchical softmax optimization for massive label spaces, and experimental results on both post content and user reviews.
Background
58 Yellow Pages is a platform where merchants post service advertisements and users search for services. The posts are pure text, lacking structured tags, which hampers retrieval, ranking, and user experience. Extracting concise, high‑quality tags from post titles, descriptions, and user reviews can highlight post characteristics and enable faster service discovery.
Tag Extraction Process
1. Candidate Word Acquisition
Text is tokenized and stop‑words removed; word cohesion and freedom scores are computed, and words above a threshold are recalled as candidate terms. N‑gram methods can also be used. This step yields a large set of potential tags.
2. Tag Extraction Workflow
The pipeline consists of training a model to extract key phrases, normalizing them, and scoring for ranking. The model outputs are fed into a hierarchical softmax or multi‑output hash to handle the ~100k tag vocabulary efficiently.
Model Architecture
Traditional unsupervised methods (TF‑IDF, TextRank, LDA) perform poorly for this task. Supervised approaches treat tag extraction as a multi‑label classification problem, but the massive label space makes a plain dense softmax infeasible. Sequence labeling models (CRF, BiLSTM‑CRF, CNN‑CRF, BERT‑CRF, RoBERTa‑CRF, IDCNN‑CRF) are employed to tag tokens directly, allowing discovery of new tags and precise source highlighting.
Experimental Comparison
Models were evaluated on a 1,000‑sample dataset (900 train, 100 test) using exact and soft metrics. Results (accuracy/recall/F1) show IDCNN‑CRF achieving the best balance (accuracy ≈ 80.03%, recall ≈ 82.76%, F1 ≈ 81.37%) with reasonable inference speed.
Hierarchical Softmax Optimization
To reduce the output layer size, a two‑level hierarchical softmax (or multi‑output hash) maps the 100k tags onto two 320‑node layers, cutting parameters from ~7.7 M to ~640 while preserving prediction speed.
Multi‑Head Softmax for Large Label Sets
Using two parallel softmax heads of 320 units each enables representation of 102,400 classes with only 492,160 parameters, dramatically simplifying training and inference.
Tag Ranking
Extracted tags are scored using TF‑IDF and similarity between tag embeddings and the post title (or first sentence) encoded by BERT; the scores are combined to order tags by importance.
Extension to Review Tagging
For short user reviews, a multi‑label classification approach with BERT and binary cross‑entropy loss yields high precision (0.9714) and recall (0.9437). Issues with bias toward positive reviews were mitigated by adding negative samples.
Conclusion
When the tag set is small, multi‑label classification suffices; for large vocabularies or when source highlighting is needed, sequence labeling combined with normalization and hierarchical output layers provides an effective solution.
Sample Post Content
全市连锁、就近派车、一条龙服务、上门估价、正规发票、绝 不加价、预约有优惠 【专业承接】居民搬家、公司搬家、空调移机,长短途搬家、搬厂、搬仓库、各企事业单位搬迁、起重吊装、精品搬运、拆装家具、箱货车搬家、金杯车搬家、尾板车搬家、长途搬家、个人搬家、小型搬家、人力搬运、钢琴搬运、装卸货柜、搬公司、搬仓库、搬写字楼、公司搬迁、仓库搬迁 、设备搬迁移位、拆装空调、搬厂、搬写字楼、拆装空调、面包车、物品包装、装卸、高难度家具拆卸、大型机器搬运、高空吊装、二次收费、随约随到、准时到达、随叫随到、24小时服务、 中途不加价、折旧赔偿、居民搬家、个人搬家、家庭搬家、别墅搬迁、人工搬运、面包车搬家、金杯车搬家、箱货车搬家、贵重物品搬运、长途搬家、异地搬家、长途搬运、跨市搬家、搬厂搬货、 大型设备搬迁、移位、起重吊装、仓库搬迁、搬机器、装卸货柜、公司搬家、搬办公室、搬公司、搬仓库、搬写字楼、企事业单位、学校搬迁、公司搬迁、写字楼搬迁、钟点工、商场搬迁、专业包装、提供纸箱、拆装家具、拆装家私、欧式家具拆装、衣柜床拆装、办公桌拆装、钢琴搬运、搬运钢琴、高空吊装、钢琴、 大理石、玻璃、建筑材料、空调拆装、空调拆装、天花机拆装、维修、移机、加雪种、清洗、回收Extraction Result
全市连锁|正规发票|不加价|居民搬家|公司搬家|空调移机|长短途搬家|搬厂|搬仓库|企事业单位搬迁|起重吊装|精品搬运|拆装家具|金杯车搬家|尾板车搬家|长途搬家|个人搬家|小型搬家|钢琴搬运|装卸货柜|搬公司|搬仓库|搬写字楼、公司搬迁|仓库搬迁|设备搬迁移位|拆装空调|搬厂|搬写字楼|拆装空调|面包车|物品包装|装卸|高难度家具拆卸|大型机器搬运|高空吊装|二次收费|随约随到|准时到达|随叫随到|24小时服务|中途不加价|折旧赔偿|居民搬家|个人搬家|家庭搬家|别墅搬迁|人工搬运|面包车搬家|金杯车搬家|箱货车搬家|贵重物品搬运|长途搬家|异地搬家|长途搬运|跨市搬家|搬厂搬货|大型设备搬迁、移位|起重吊装|仓库搬迁|搬机器|装卸货柜|公司搬家|搬办公室|搬公司|搬仓库|搬写字楼、企事业单位、学校搬迁|公司搬迁|写字楼搬迁|钟点工|商场搬迁|专业包装|提供纸箱|拆装家具|拆装家私|欧式家具拆装|衣柜床拆装|办公桌拆装|钢琴搬运|搬运钢琴|高空吊装|钢琴|大理石、玻璃、建筑材料|空调拆装|空调拆装|天花机拆装、维修、移机、加雪种、清洗、回收58 Tech
Official tech channel of 58, a platform for tech innovation, sharing, and communication.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.