Artificial Intelligence 20 min read

Reinforcement Learning for E‑commerce Search Ranking: RNN User State Modeling and DDPG Long‑Term Value Optimization

This presentation details how JD applied reinforcement learning—using RNN‑based user state modeling and a DDPG framework—to improve e‑commerce search ranking by optimizing long‑term cumulative value, showing significant offline and online gains in conversion and GMV.

DataFunSummit
DataFunSummit
DataFunSummit
Reinforcement Learning for E‑commerce Search Ranking: RNN User State Modeling and DDPG Long‑Term Value Optimization

The talk, presented by Miao Da‑Dong, JD algorithm engineer and organized by DataFunTalk, introduces a reinforcement‑learning‑based solution for e‑commerce search ranking that aims to maximize long‑term cumulative reward rather than only immediate click or conversion signals.

1. Search Ranking Scenario and Algorithm Overview

In typical e‑commerce search, the pipeline consists of recall, coarse ranking, fine ranking, re‑ranking and mixing, with the optimization goal of increasing user conversion. Traditional supervised training optimizes immediate feedback at each iteration, ignoring the fact that user state evolves with each interaction. Reinforcement learning is used to model the interactive process between user and search system to capture long‑term value, and the solution has been deployed at JD at scale.

2. Reinforcement Learning Modeling Process

The work was published at CIKM 2021. The RL objective is to maximize the expected long‑term value Q, which sums immediate reward and discounted future rewards. In the search setting, a user request triggers the ranking engine to select an action (a score for candidate items) based on the current user state; the user’s feedback (click, purchase) updates the state and provides the reward for the next iteration.

The industrial solution differs from prior work (e.g., Alibaba) by modeling not only intra‑session page turns but also multi‑session user decision processes, thus capturing the cumulative effect of a user’s entire search journey.

State modeling: use an RNN to represent user state and its transition.

Long‑term value modeling: apply DDPG on top of the RNN state.

3. RNN‑Based User State Transition Modeling

The baseline model is DIN, which captures relations between target items and historical behavior but cannot model sequential state changes. To capture real‑time user state updates, three layers are designed:

Data layer: sequential user query data.

Model layer: an RNN (GRU) processes each session; hidden states are passed to the next timestep, enabling sequential state representation.

Architecture layer: a real‑time incremental path updates user state online.

Training samples are constructed by grouping all items shown in a search session; padding is used to align sequence lengths, and short‑session users are filtered and concatenated to reduce unnecessary padding, achieving a three‑fold speedup.

Offline evaluation shows a 0.58 % AUC@10 lift over the DIN baseline, and online A/B tests confirm significant improvements in conversion rate and GMV.

4. DDPG‑Based Long‑Term Value Modeling

The pipeline defines the action space (ranking scores), uses the RNN‑derived user state, designs a reward based on the score difference between positive and negative samples, and selects an actor‑critic algorithm (DDPG) for training. Three reward functions were evaluated: constant, cross‑entropy, and sigmoid. The sigmoid reward yielded the best stability and a 0.16 % metric gain.

The DDPG architecture consists of:

State network: RNN for user state transition.

Actor network: takes the state vector and outputs a continuous scoring action.

Critic network: evaluates the action‑state pair to produce the estimated long‑term value, guiding the actor.

Two loss functions are combined: a policy‑gradient loss for the actor and a temporal‑difference loss for the critic. Proper weighting is essential for convergence; experiments show that extreme weights (0 or 1) cause divergence.

DDPG improves Session AUC@10 by 0.83 % offline and yields >1 % lift in conversion and GMV online. Stability analysis over two weeks shows initial metric oscillation followed by consistent superiority over the RNN baseline, especially for users with many historical searches.

5. Planning and Outlook

Future work includes moving from offline‑only RL to fully online on‑policy learning for more responsive policy updates, and exploring mixed‑ranking of JD main‑site and LBS (hour‑buy) items to maximize overall revenue.

6. Q&A Highlights

Q: What is included in the dump feature? A: All item features used for the current request plus the RNN hidden state for incremental online updates.

Q: Have you tried offline RL? A: Yes, the presented model is an offline RL approach that trains on historical data, updates daily, and performs online incremental state updates.

Q: How is the model deployed and what is its performance? A: Deployed via JD’s Predictor service; RNN state is cached, so each forward pass is an incremental computation with negligible latency.

Q: Is the environment static during training? A: Yes, training uses static historical sessions; online serving provides real‑time feedback for incremental updates.

Q: Why use a continuous reward function? A: Because actions (ranking scores) are continuous; a reward that grows with the score gap encourages better discrimination and stabilizes training.

Q: Will the online model explore? A: Current online models are offline‑trained and updated daily without exploration; future on‑policy RL will incorporate safe exploration mechanisms.

Thank you for attending.

E-commercereinforcement learningsearch rankinguser modelingRNNDDPG
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.