Design, Evaluation, and Production of a VOC Tagging System for Taobao User Experience
Taobao’s Technical Industry Data team designed a four‑level VOC tagging hierarchy to unify fragmented user‑feedback sources, evaluated label similarity with vector‑based distance matrices, optimized tag groups via entropy‑driven re‑grouping, built a stacking ensemble of FastText and TextCNN achieving over 90% accuracy, and deployed an automated production pipeline that generates tags, maintains ODPS tables, and provides APIs for rapid experimentation.
This article is the sixth part of a ten‑article series that shares Taobao's user‑experience data‑science practices, covering product detail pages, logistics, performance, and customer service.
What is VOC? VOC (Voice of Customer) data consists of unstructured user feedback such as inquiries, complaints, and reviews. VOC tagging extracts semantic information from massive VOC texts using NLP, creating a structured label system that helps identify experience problems and drive business improvements.
Challenges of Taobao VOC tags include fragmented data sources (customer service, reviews, chat bots, etc.), diverse text formats (single messages, long paragraphs, conversational logs), and heterogeneous label definitions across business units, leading to low reusability.
Tag hierarchy design adopts a four‑level structure: the first three levels are generic (e.g., "Product Inquiry → Attribute Inquiry → Brand"), while the fourth level is industry‑specific and stored as key‑value pairs. This design enables both uniformity and customization.
Tag‑structure evaluation builds a distance matrix by vectorizing VOC texts (TF‑IDF, word2vec, or BERT) and computing Euclidean distances between label clusters. Labels with high similarity are merged based on statistical thresholds (e.g., lower quartile). The evaluation objective minimizes variance of the distance distribution, ensuring a balanced and discriminative tag set.
Optimization workflow enumerates binary re‑groupings, calculates information‑gain using entropy, selects the best re‑grouping, removes the merged pair, and repeats until no further improvement is possible. This reduces manual effort dramatically.
VOC sample construction uses high‑confidence (>0.99) VOC records, cleans noise, and performs stratified sampling per label. Imbalanced groups are addressed via down‑sampling or over‑sampling to obtain balanced training sets.
Model training employs a stacking ensemble: FastText and TextCNN serve as base learners, combined with boosting and linear regression to produce strong classifiers. Multi‑class models are trained per label group, achieving >90% accuracy and a production cycle of about one week.
Production pipeline includes automatic tag generation, ODPS dimension‑table maintenance, and service APIs. Large‑scale batch jobs deliver tags as ODPS tables, while lightweight requests receive code, pretrained models, or UDFs for rapid experimentation.
Team introduction – the Taobao Technical Industry Data team focuses on data engineering, mining, and governance for e‑commerce scenarios and is actively hiring.
DaTaobao Tech
Official account of DaTaobao Technology
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.