Query Understanding in JD Daojia E‑commerce Search: Architecture, Core Algorithms, and Experimental Results
This article presents a comprehensive overview of JD Daojia's query understanding system for e‑commerce search, detailing its overall architecture, core modules such as tokenization, term weighting, query rewriting, intent detection, the algorithms employed, experimental evaluations, and future directions.
1. Introduction
Search is the primary traffic entry for the JD Daojia app, covering various entry points such as homepage, in‑store, channel, and mini‑program searches. Accurately understanding user queries and ranking the most relevant results at the top are critical for search experience. Query understanding in e‑commerce involves lexical, syntactic, and semantic parsing to transform raw queries into structured representations that feed both retrieval and ranking modules.
2. Overall Architecture
The search pipeline proceeds from query understanding to retrieval and then ranking. Query understanding provides features for both recall and ranking, influencing overall system intelligence. Typical modules include preprocessing, correction, expansion, normalization, suggestion, segmentation, intent recognition, term importance analysis, and sensitive query detection. JD Daojia’s O2O scenario adds category inclination and store demand identification. The flow diagram is shown below.
Example query "康师傅红烧方便面*" is processed through segmentation, preprocessing, term weighting, rewriting, entity recognition, and intent identification, yielding structured entities such as Brand, Attribute, and Entity.
3. Core Algorithms of Query Understanding
3.1 Segmentation
3.1.1 Segmentation Techniques
Segmentation splits a query into terms (e.g., "康师傅|红烧|方便面"). Accuracy directly affects downstream modules like term importance and intent detection. JD Daojia uses a DAG‑based statistical segmentation model with steps: dictionary loading, DAG construction, dynamic programming for maximum‑probability path, and back‑tracking.
Dictionary is stored as a prefix trie.
All possible segmentations are represented as a DAG; dynamic programming finds the highest‑probability path based on term frequencies.
Probabilities are log‑transformed to avoid underflow.
Balancing coarse and fine granularity improves both precision and recall.
3.1.2 New‑Word Discovery
Unregistered words are discovered using statistical measures such as pointwise mutual information (cohesion) and left/right neighbor entropy (freedom). Words with high cohesion and entropy are added to the dictionary after manual verification.
3.2 Term Weighting
Term importance influences retrieval and ranking. Methods include TF‑IDF, static importance (IMP) based on click data, user‑click‑based weighting, and supervised feature‑learning models (LR, XGBoost, LSTM). JD Daojia adopts a pairwise ranking model with feature vectors (offline IQF/IDF/clicks and online POS/semantic embeddings) trained via cross‑entropy loss and gradient descent (preferring batch GD for stability).
3.3 Query Rewriting
Rewriting expands a query into semantically equivalent variants to improve recall. Approaches covered are edit‑distance/pinyin similarity, collaborative filtering on click co‑occurrence, knowledge‑graph synonym substitution, machine‑translation with reinforcement learning, and the proprietary Query2Vec session‑based method.
3.3.1 Our Approach
We combine collaborative filtering (QueryCF), SimRank/SimRank++ graph‑based similarity, Query2Vec session embeddings, and synonym‑based token replacement. Results are weighted by confidence and merged for parallel recall.
3.3.3 Experimental Effect
Rewriting yields relative improvements of 0.33% in click‑through rate, 0.4% in conversion rate, and 2.28% in ARPU.
Click‑through Rate
Conversion Rate
ARPU
Relative Lift
0.33%
0.4%
2.28%
3.4 Intent Recognition
Intent detection faces challenges such as noisy input, ambiguity, cold‑start, and lack of direct quantitative metrics. The pipeline extracts components (brand, product, attribute, theme) and predicts category inclination using hierarchical multi‑label classification or semantic models fused with click/transaction features.
3.4.1 Component Extraction
Entity recognition uses a Bi‑LSTM+CRF model trained on annotated query data with tags B‑brand, I‑brand, B‑attr, etc., achieving ~93% accuracy.
3.4.2 Category Prediction
Two strategies are employed: hierarchical multi‑label classification using CNNs (fastText/BERT embeddings) with fine‑tuning across label levels, and a fusion model combining click‑based statistical scores with a BERT+GBDT semantic model. Feature set includes semantic similarity scores, price ranges, recall statistics, and token counts.
3.4.3 Evaluation
Precision, recall, and F1 are computed in a multi‑label setting by averaging per‑sample contributions. The intent system reaches 93% overall accuracy, with downstream business metrics showing an 8.42% increase in search ARPU.
4. Summary and Outlook
The paper details JD Daojia’s query understanding pipeline, covering segmentation, term weighting, rewriting, and intent detection, along with practical implementations and experimental gains. Future work includes incorporating user personalization, richer contextual signals, and deep reinforcement learning to better handle long‑tail queries.
5. References
SimRank: A Measure of Structural‑Context Similarity SimRank++: Query Rewriting through Link Analysis of the Click Graph Context and Content‑aware Embeddings for Query Rewriting in Sponsored Search HFT‑CNN: Learning Hierarchical Category Structure for Multi‑label Short Text Categorization 基于DNN+GBDT的Query类目预测融合模型
Dada Group Technology
Sharing insights and experiences from Dada Group's R&D department on product refinement and technology advancement, connecting with fellow geeks to exchange ideas and grow together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.