Reinforcement Learning and Multi‑Task Recommendation: Two‑Stage Constrained Actor‑Critic and Multi‑Task RL Approaches at Kuaishou
This talk presents Kuaishou's research on combining reinforcement learning with multi‑task recommendation, detailing a two‑stage constrained actor‑critic method for short‑video ranking, a multi‑task RL framework, experimental results on offline and online systems, and practical Q&A insights.
The presentation introduces two research works from Kuaishou that explore the integration of reinforcement learning (RL) with multi‑task recommendation for short‑video platforms.
Work 1 – Two‑Stage Constrained Actor‑Critic for Short Video Recommendation: The authors describe a constrained multi‑task scenario where watch time (a dense, continuous signal) is treated as the primary objective and interaction metrics such as likes, follows, and comments serve as auxiliary objectives. By formulating the problem as a Lagrangian dual with a main utility and lower‑bound constraints, they avoid the typical Pareto trade‑off and prioritize the primary goal. The solution splits optimization into two stages: first, an actor‑critic optimizes the auxiliary tasks using separate critics for each interaction type; second, the main objective (watch time) is optimized with a weighted actor‑critic that incorporates the auxiliary constraints via a closed‑form solution derived from completing the square.
Experimental results on offline datasets show that the two‑stage actor‑critic outperforms baselines on watch time while preserving interaction metrics. Online A/B tests confirm that the method improves watch time and respects interaction constraints better than previous ranking‑based strategies.
Work 2 – Multi‑Task Recommendations with Reinforcement Learning: This collaborative project with Hongcheng University tackles traditional multi‑task learning (MTL) for jointly optimizing click‑through rate (CTR) and conversion‑rate (CVR). The authors propose a Reinforced Multi‑Task Learning (RMTL) framework that adjusts task weights based on long‑term value estimates from a Markov Decision Process (MDP). They integrate an ESMM‑style backbone, using separate actors and critics for each task, and apply soft updates and Q‑function learning to stabilize training.
Extensive experiments on two public datasets demonstrate that RMTL consistently improves over baselines such as ESMM, MMoE, and PLE, with PLE achieving the best shared embedding performance while ESMM excels on CVR for the Kuairand task. Transferability studies show that the learned critic can be grafted onto other models to boost performance, and ablation studies confirm the effectiveness of the proposed weighting scheme.
Conclusion and Practical Insights: The authors summarize key lessons: RL combined with multi‑task optimization is well‑suited for long‑term recommendation goals; soft regularization can enforce auxiliary constraints; data quality, label accuracy, and model supervision are critical; and handling sparse signals may require proxy metrics or indirect optimization. The Q&A session addresses loss functions for watch time, handling sparse interaction signals, multi‑objective tuning via RL, and strategies for cold‑start users.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.