Artificial Intelligence 18 min read

Advanced Practices in E‑commerce Recommendation: Multi‑Objective Optimization, User Behavior Sequence Modeling, Fine‑Grained Behavior Modeling, and Multimodal Features

The article presents JD's end‑to‑end recommendation pipeline, detailing the four‑stage ranking chain, challenges of fine‑ranking, and practical solutions including multi‑objective learning, transformer‑based user behavior sequence modeling, fine‑grained click behavior integration, and multimodal image features, with offline and online performance gains.

DataFunTalk
DataFunTalk
DataFunTalk
Advanced Practices in E‑commerce Recommendation: Multi‑Objective Optimization, User Behavior Sequence Modeling, Fine‑Grained Behavior Modeling, and Multimodal Features

Speaker and Source : Wang Dongyue, JD Institutional Head; edited by Wu Qiyao; published on DataFunTalk.

Overview : JD's recommendation covers the full user journey—pre‑purchase (feeds, "My JD"), purchase (cart, detail page), and post‑purchase (order page)—with daily DAU in the tens of millions. Optimization goals differ across these scenarios.

Ranking Pipeline : The system consists of four modules—recall, coarse ranking, fine ranking, and re‑ranking—each handling decreasing candidate volumes (billions → tens of thousands → hundreds → tens). Fine‑ranking faces three main challenges: multi‑objective optimization, diverse user expression, and rich item expression.

To address these, JD employs multi‑objective learning, user behavior sequence modeling, and multimodal features, with domain‑specific enhancements.

01 Business & Scenarios : JD recommends across the entire purchase funnel, optimizing click‑through, browsing depth, conversion, and cross‑category clicks depending on the stage.

02 Multi‑Objective Optimization : JD optimizes ~6–8 objectives (click, conversion, GMV, etc.). A traditional multi‑objective model combines target predictions via a static combination layer. JD introduces a personalized fusion network (gate‑based) that learns dynamic weights per user state, allowing the model to prioritize conversion for buying‑intent users and clicks for browsing users.

The fusion network is trained alternately with the main MMoE network (every 10 batches) and uses gradient blocking to keep the main network unaffected. Online experiments show higher fusion weights for conversion in order sessions and higher click weights in browsing sessions, yielding significant CTR and conversion lifts across feeds, live streams, and cart models.

03 User Behavior Sequence Modeling : The goal is to extract user preferences from raw click/order sequences. Challenges include diverse interests, temporal decay, noisy actions, and latency constraints. JD adopts a Transformer‑based encoder‑decoder with added time‑difference (recency) encoding alongside positional encoding. The encoder captures first‑ and second‑order interest vectors; the decoder applies multi‑head target attention to align these interests with candidate items.

Engineering optimizations (batch‑wise encoder sharing, parallel inference) reduced online latency from ~73 ms to 33 ms (tp99). Offline analysis confirms the model learns stronger attention for more recent behaviors, leading to a 3.5 % CTR increase on the homepage feed.

04 Fine‑Grained Behavior Modeling : Post‑click signals (dwell time, operation count, main‑image view, comment browse) are encoded via an MLP into a fixed‑length vector and combined with temporal‑spatial encodings via bitwise‑add. This enriches the user interest representation, improving homepage feed CTR by 2 % and conversion by 1.7 %.

05 Multimodal Features : To mitigate cold‑start and enrich item representation, JD fine‑tunes a ResNet‑101 image model on product‑word classification (≈100 k categories), extracting a 64‑dim image embedding. A transfer layer with a multi‑gate‑multi‑expert structure aligns image embeddings with other features based on item category. Deployed on homepage feeds and core channels, this yields 2.6 % and 5 % CTR lifts respectively.

Q&A Highlights : Online multi‑objective models typically handle six goals; weights are balanced via task‑specific scaling. The fusion network outputs per‑goal weights that are multiplied with predictions before ranking. Image embedding fine‑tuning freezes all but the final classification layer, extracting a 64‑dim vector for each product image. Joint multimodal training is avoided due to complexity and service impact. Two separate loss functions optimize the main MMoE network and the fusion network independently.

Thank you for listening.

e-commercetransformeruser behavior modelingrecommendation systemsmulti‑objective optimizationmultimodal featuresfine-grained behavior
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.