Artificial Intelligence 16 min read

Two-Stage Constrained Actor-Critic for Short‑Video Recommendation and a Reinforcement‑Learning Multi‑Task Framework

This article presents a two‑stage constrained actor‑critic (TSCAC) algorithm that models short‑video recommendation as a constrained reinforcement‑learning problem, details its theoretical formulation and optimization loss, and validates its superiority through extensive offline and online experiments, followed by a multi‑task reinforcement‑learning framework (RMTL) that further improves multi‑objective recommendation performance.

DataFunSummit
DataFunSummit
DataFunSummit
Two-Stage Constrained Actor-Critic for Short‑Video Recommendation and a Reinforcement‑Learning Multi‑Task Framework

The talk introduces three topics: (1) a two‑stage constrained reinforcement‑learning algorithm for short‑video recommendation, (2) a reinforcement‑learning based multi‑task recommendation framework, and (3) a Q&A session.

Problem Modeling – Short‑video recommendation is cast as a constrained Markov Decision Process (CMDP) where the primary objective is to maximize total watch time while satisfying interaction‑metric constraints (likes, comments, shares, etc.). Existing constrained RL methods are unsuitable due to a single critic that dominates the reward and the high cost of searching Lagrange multipliers for multiple constraints.

Two‑Stage Constrained Actor‑Critic (TSCAC) – In the first stage, separate policies are learned for each auxiliary interaction signal using dedicated critics. In the second stage, a main policy is optimized for watch time while being softly constrained to stay close to the auxiliary policies; the optimal dual solution is derived analytically and a new KL‑based loss is introduced. Offline evaluation on the KuaiRand dataset and online A/B tests on the Kuaishou app show that TSCAC outperforms Pareto optimization, RCPO, and baseline ranking models.

Offline Experiments – TSCAC is compared with Behavior Cloning (Wide&Deep), DeepFM, RCPO, and Pareto‑based methods. Results demonstrate significant improvements in watch time as well as interaction metrics (click, like, comment).

Online Experiments – An online A/B test with Learning‑to‑Rank as baseline shows TSCAC achieving higher watch time (0.1% statistically significant) and better auxiliary metrics compared to RCPO and an interaction‑only actor‑critic variant.

Reinforcement‑Learning Multi‑Task Learning (RMTL) Framework – To address multi‑task recommendation, RMTL builds a session‑level MDP, uses a multi‑critic architecture, and dynamically adjusts task‑specific loss weights via the critic’s Q‑values. Experiments on KuaiRand and RetailRocket datasets reveal that RMTL consistently improves AUC, log‑loss, and s‑log‑loss over state‑of‑the‑art multi‑task models, and the pretrained critics transfer well across different MTL backbones.

Conclusions – Constrained RL and RL‑based multi‑task learning provide effective solutions for jointly optimizing primary and auxiliary objectives in recommendation systems, with practical benefits demonstrated both offline and in production environments.

multi-task learningRecommendation systemsreinforcement learningconstrained optimizationonline A/B testing
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.