Multi‑Objective Deep Reinforcement Learning Framework for E‑commerce Traffic Allocation (MODRL‑TA)
The article presents a CIKM‑2024 paper that introduces MODRL‑TA, a multi‑objective deep reinforcement learning system combining multi‑objective Q‑learning, a cross‑entropy‑based decision‑fusion algorithm, and a progressive data‑augmentation pipeline to dynamically allocate search traffic on JD.com, with both offline and online experiments showing substantial gains in CTR, CVR, and overall platform performance.
The JD.com search team’s paper, accepted at CIKM 2024, addresses the problem of traffic control in e‑commerce search, where adjusting the post‑ranking position of items reallocates natural traffic to maximize merchant growth, satisfy customer demand, and balance platform interests.
Existing ranking‑learning methods ignore the long‑term value of traffic allocation, while standard reinforcement‑learning approaches struggle to balance multiple objectives and suffer from cold‑start issues in real‑world data. To overcome these challenges, the authors propose a Multi‑Objective Deep Reinforcement Learning framework (MODRL‑TA) consisting of three key components:
Multi‑Objective Q‑Learning (MOQ): Independent deep Q‑network models are trained for each objective (e.g., click‑through rate, conversion rate). Each model estimates the long‑term value of its target and decides the insertion position of a product in the ranked list.
Decision‑Fusion Module (DFM): A cross‑entropy method (CEM) dynamically adjusts the weights of the objectives, allowing the system to respond to time‑varying merchant preferences and to mitigate cold‑start problems.
Progressive Data‑Augmentation (PDA): Initially trains MOQ on simulated offline logs; as real‑world interactions are collected, PDA progressively replaces simulated data with authentic user feedback, smoothing distribution shift and eliminating the cold‑start bottleneck.
The state representation includes user profile features, query attributes, historical user‑item interactions, contextual item features, and aggregated feedback signals. Actions correspond to the insertion position of a selected item (a_t ∈ R_L). Rewards are defined per objective, e.g., higher reward for higher click probability or order probability.
Training employs standard DQN loss minimization with separate evaluation and target networks for stability. The overall loss aggregates the individual objective losses, enabling shared representation learning while preserving objective‑specific parameters.
Extensive offline experiments on JD’s main search platform show that MODRL‑TA outperforms a MORL‑FR baseline, achieving up to 12.20 × CTR reward and 2.25 × CVR reward when using 100 % real data. Online A/B testing over two weeks demonstrates up to 18.0 % increase in impressions, 4.2 % rise in CTR, and 5.1 % boost in CVR compared with the PID algorithm, confirming the framework’s practical impact for over 600 million daily active users.
The authors conclude with future directions, emphasizing the need for finer‑grained algorithm design, stronger computational resources, and robust multi‑objective learning in dynamic, uncertain environments.
Team information and author bios are provided, along with a call for talent to join JD’s search algorithm team.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.