Artificial Intelligence 31 min read

Real-Time Controllable Multi-Objective Re‑ranking for Taobao Feed

This article presents a comprehensive study of a controllable multi‑objective re‑ranking model for Taobao's information‑flow recommendation, detailing the challenges of complex feed scenarios, three modeling paradigms (V1‑V3), an actor‑critic reinforcement learning framework with hypernet‑generated weights, and extensive online evaluation results.

DataFunTalk

Nov 14, 2023

Real-Time Controllable Multi-Objective Re‑ranking for Taobao Feed

The presentation introduces the challenges faced by Taobao's information‑flow recommendation, such as content diversity, cold‑start exposure, mixed media types, multi‑supply fusion, and numerous business objectives, and explains why traditional pipeline‑style ranking struggles to satisfy these requirements.

It then outlines three re‑ranking modeling paradigms: V1 (context‑aware single‑point scoring), V2 (sequential item selection with state‑based attention), and V3 (reward‑driven reinforcement learning that eliminates the need for explicit labels). Each paradigm’s strengths and limitations are discussed.

The core solution is an actor‑critic architecture where the actor (a DeepSet encoder plus PointerNet decoder) generates a ranked sequence and the evaluator estimates a reward based on multiple utilities (click‑through rate, diversity, freshness, etc.). The reward is a weighted sum of utilities, and the weights \(w\) can be specified in real time.

To enable dynamic weight adjustment without retraining, a hypernetwork predicts weight‑sensitive parameters \(\theta_w\) for the re‑ranking model. During offline training, each sample randomly draws a weight vector from a predefined distribution; the actor generates a sequence, the evaluator computes the reward, and gradients update both the hypernetwork and the weight‑insensitive parts of the model.

Extensive online experiments (A/B tests) demonstrate that the controllable re‑ranking model improves click‑through rate, cold‑start content ratio, shop diversity, group ordering, user dwell time, and content volume compared with the baseline pipeline approach.

The article concludes with a Q&A covering topics such as weight personalization, offline training with sampled weights, handling of mixed media, latency (≈20 ms P99), feature engineering for heterogeneous items, and evaluation metrics (reward and better‑percentage).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Recommendation Systems reinforcement learning Real-time Control multi-objective optimization Re‑ranking hypernetworks

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.