Multi-Category Mixture-of-Experts Model for JD Search Ranking
This article presents a multi‑category Mixture‑of‑Experts (MoE) approach for e‑commerce search ranking, addressing category‑specific behavior and small‑category learning by introducing hierarchical soft constraints and adversarial regularization, and demonstrates significant AUC and NDCG gains on Amazon and JD in‑house datasets.
Product search engines are crucial for e‑commerce platforms; they return personalized ranked lists based on user queries. The article introduces the application of a Mixture‑of‑Experts (MoE) model to JD.com’s search ranking and describes practical improvements made for real‑world scenarios.
Background. User behavior varies across top‑level categories (e.g., food vs. apparel), and small sub‑categories suffer from data scarcity, causing their signals to be dominated by large categories during training.
MoE basics. A classic MoE consists of a gating network and multiple expert networks. The gate outputs weights for each expert, and the final output is a weighted sum of expert predictions. Top‑K gating selects the K highest‑weight experts, reducing computation.
Proposed multi‑category MoE. Leveraging JD’s hierarchical category tree, the authors extend Top‑K MoE with two enhancements:
Hierarchical Soft Constraint (HSC) : an additional gate takes the top‑level category as input; its output is aligned (L2 distance) with the main MoE gate weights, encouraging experts for the same top‑category to be similar.
Adversarial Regularization : for each sample, a randomly chosen non‑activated expert is forced to produce predictions that differ from the activated experts, increasing diversity among experts.
The overall training loss combines the original MoE loss, the HSC loss, and the adversarial regularization term.
Experiments. The model was evaluated on the public Amazon dataset and JD’s internal dataset, using DNN and a standard MoE as baselines. Variants with only HSC (HSC‑MoE) or only adversarial regularization (Adv‑MoE) were also tested. Results show consistent AUC and NDCG improvements, with statistically significant p‑values.
Further analysis of gate outputs using t‑SNE demonstrates clearer clustering of similar sub‑categories and better separation of dissimilar ones, confirming that the hierarchical constraint guides the model to activate appropriate experts.
Performance gains across categories with varying sample sizes were also reported, showing larger relative improvements for low‑frequency categories.
For full methodological details, see the authors’ ICDE 2021 paper: Xiao, Zhuojian et al., “Adversarial Mixture Of Experts with Category Hierarchy Soft Constraint.”
Reference: Shazeer et al., “Outrageously large neural networks: The sparsely‑gated mixture‑of‑experts layer,” arXiv:1701.06538 (2017).
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.