Advanced Practices in E‑commerce Recommendation: Multi‑Objective Ranking, User Behavior Sequence Modeling, Fine‑Grained Behavior Modeling, and Multimodal Features
The article presents JD's e‑commerce recommendation system, detailing its four‑stage ranking pipeline, multi‑objective optimization with personalized fusion, transformer‑based user behavior sequence modeling, fine‑grained behavior modeling, and multimodal feature integration, and shares experimental results and engineering optimizations.
JD's recommendation platform covers the entire user journey—pre‑purchase (feeds, "My JD"), purchase (cart, product detail), and post‑purchase (order page)—with daily active users in the tens of millions, requiring different optimization goals for each scenario.
The ranking pipeline consists of recall, coarse ranking, fine ranking, and re‑ranking, each handling decreasing candidate volumes (billions → tens of thousands → hundreds → tens) and focusing on distinct objectives.
Fine‑ranking faces three main challenges: multi‑objective optimization (click‑through, conversion, GMV, depth, diversity), diverse user intent expression, and rich product attributes (brand, category, images, titles) that demand multimodal modeling.
To address these, JD employs multi‑objective learning, user behavior sequence modeling, and multimodal features, with specific innovations for e‑commerce.
01 Business & Scenarios
The recall stage uses i2i and u2i methods, coarse ranking employs a dual‑tower model for accuracy and latency, fine ranking adds complex features and model structures, and re‑ranking incorporates diversity and novelty.
02 Multi‑Objective Optimization
Online, JD optimizes around seven to eight objectives; the presentation focuses on click and conversion. A traditional multi‑objective model combines target predictions via a combination layer, while JD proposes a personalized fusion network that dynamically generates fusion weights based on user state.
The fusion network (MMoE‑based main network plus a gate‑driven fusion network) learns to combine target predictions without affecting the main network's gradients, using alternating training and gradient blocking.
Experiments show the learned fusion weights align with offline grid‑search results, providing higher conversion weights for purchase‑intent sessions and higher click weights for browsing sessions, yielding significant online CTR and conversion improvements across feeds, live streams, and cart ranking.
03 User Behavior Sequence Modeling
The goal is to extract dense user interest vectors from raw click or purchase sequences using a Transformer‑based encoder‑decoder. Challenges include diverse interests, temporal decay, noisy actions, and online latency.
JD enhances the Transformer with temporal encoding (position + recency) and incorporates first‑ and second‑order interest vectors via feature crossing. The decoder applies multi‑head target attention to retrieve item‑relevant preferences.
Engineering optimizations reduce online latency from ~73 ms to ~33 ms by sharing encoder outputs across requests and parallelizing graph execution.
Analysis of attention heads confirms the model captures interest evolution, leading to a 3.5 % increase in feed‑page CTR.
04 Fine‑Grained Behavior Modeling
Fine‑grained actions (dwell time, clicks on main image, comment browsing, etc.) are encoded via an MLP into a fixed‑length vector and combined with temporal encodings using bitwise‑add. Discretized dwell time and action counts are learned end‑to‑end, while categorical fine‑grained actions are ID‑embedded and sum‑pooled.
Online results show a 2 % lift in feed CTR and a 1.7 % lift in conversion.
05 Multimodal Features
Visual and textual modalities (product images and titles) are leveraged to mitigate cold‑start issues. Image embeddings are obtained by fine‑tuning a ResNet‑101 backbone on JD's product‑word classification data, extracting a 64‑dim vector.
A transfer layer with a multi‑gate multi‑expert structure aligns image embeddings with other end‑to‑end features, after which they are concatenated and fed into the Transformer.
Offline similarity analysis confirms that visually similar items have higher embedding similarity. Deploying the multimodal model on home‑page feeds and core channels yields 2.6 % and 5 % CTR lifts respectively.
06 Q&A
Q1: How many objectives are used online and how is dominance avoided? A: Six objectives on the home page; each is weighted to balance magnitude.
Q2: Are multiple objective outputs used simultaneously? A: Yes, the personalized multi‑objective model fuses them for ranking.
Q3‑5: Detailed explanations of multimodal image embedding fine‑tuning, fusion model loss design, and why joint multimodal training is not adopted due to complexity.
Thank you for listening.
Please like, share, and give a triple‑click!
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.