Artificial Intelligence 18 min read

Design, Evaluation, and Production of a VOC Tagging System for Taobao User Experience

Taobao’s Technical Industry Data team designed a four‑level VOC tagging hierarchy to unify fragmented user‑feedback sources, evaluated label similarity with vector‑based distance matrices, optimized tag groups via entropy‑driven re‑grouping, built a stacking ensemble of FastText and TextCNN achieving over 90% accuracy, and deployed an automated production pipeline that generates tags, maintains ODPS tables, and provides APIs for rapid experimentation.

DaTaobao Tech
DaTaobao Tech
DaTaobao Tech
Design, Evaluation, and Production of a VOC Tagging System for Taobao User Experience

This article is the sixth part of a ten‑article series that shares Taobao's user‑experience data‑science practices, covering product detail pages, logistics, performance, and customer service.

What is VOC? VOC (Voice of Customer) data consists of unstructured user feedback such as inquiries, complaints, and reviews. VOC tagging extracts semantic information from massive VOC texts using NLP, creating a structured label system that helps identify experience problems and drive business improvements.

Challenges of Taobao VOC tags include fragmented data sources (customer service, reviews, chat bots, etc.), diverse text formats (single messages, long paragraphs, conversational logs), and heterogeneous label definitions across business units, leading to low reusability.

Tag hierarchy design adopts a four‑level structure: the first three levels are generic (e.g., "Product Inquiry → Attribute Inquiry → Brand"), while the fourth level is industry‑specific and stored as key‑value pairs. This design enables both uniformity and customization.

Tag‑structure evaluation builds a distance matrix by vectorizing VOC texts (TF‑IDF, word2vec, or BERT) and computing Euclidean distances between label clusters. Labels with high similarity are merged based on statistical thresholds (e.g., lower quartile). The evaluation objective minimizes variance of the distance distribution, ensuring a balanced and discriminative tag set.

Optimization workflow enumerates binary re‑groupings, calculates information‑gain using entropy, selects the best re‑grouping, removes the merged pair, and repeats until no further improvement is possible. This reduces manual effort dramatically.

VOC sample construction uses high‑confidence (>0.99) VOC records, cleans noise, and performs stratified sampling per label. Imbalanced groups are addressed via down‑sampling or over‑sampling to obtain balanced training sets.

Model training employs a stacking ensemble: FastText and TextCNN serve as base learners, combined with boosting and linear regression to produce strong classifiers. Multi‑class models are trained per label group, achieving >90% accuracy and a production cycle of about one week.

Production pipeline includes automatic tag generation, ODPS dimension‑table maintenance, and service APIs. Large‑scale batch jobs deliver tags as ODPS tables, while lightweight requests receive code, pretrained models, or UDFs for rapid experimentation.

Team introduction – the Taobao Technical Industry Data team focuses on data engineering, mining, and governance for e‑commerce scenarios and is actively hiring.

e-commercemachine learningdata scienceNLPtaggingVOC
DaTaobao Tech
Written by

DaTaobao Tech

Official account of DaTaobao Technology

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.