KuaiSim: A Comprehensive User Simulator for Reinforcement Learning in Recommendation Systems
KuaiSim is a comprehensive user simulation environment for recommendation systems that models immediate, long‑term, and cross‑session feedback, supports list‑wise, whole‑session, and retention tasks, provides baselines and evaluation metrics, and demonstrates superior performance on KuaiRand and ML‑1M datasets.
Abstract
Reinforcement‑learning (RL) based recommender systems attract attention for learning optimal policies and maximizing long‑term user reward, but deploying RL models online and collecting real A/B data is costly. Simulators offer an offline alternative, yet existing ones suffer from limited feedback types, inconsistency with real data, evaluation challenges, and poor cross‑RS transfer. KuaiSim is a comprehensive user simulator that generates multi‑behavior and cross‑session signals, supporting three task levels: list‑wise recommendation, whole‑session sequential recommendation, and cross‑session retention optimization, and provides evaluation methods and baseline algorithms for future benchmarking.
1 Introduction
Deploying RL models directly in live environments is resource‑intensive; simulators enable offline training and evaluation of recommender models. Existing simulators lack long‑term feedback modeling, exhibit data distribution gaps with real logs, and provide limited evaluation tools.
2 KuaiSim Workflow
Figure 1 (right) shows a generic MDP where a recommender receives a user request and produces an action (a list of items). Each request consists of static user profile features and dynamic interaction history. KuaiSim consists of three modules: an instant feedback model, a leave‑session model, and a retention model, producing three feedback types—instant feedback, leave signal, and return‑time signal.
2.1 Instant Feedback Module
The Instant Feedback Module (UIRM) infers the latent user state and outputs probabilities for each instant feedback type. It uses an item_correlation function to penalize highly similar items, encouraging diversity. The module is pretrained on log data using binary cross‑entropy for each feedback type.
2.2 Leave Module
The Leave Module models user patience; patience decreases with each interaction and triggers a leave signal when it falls below a threshold. Instant reward derived from UIRM guides patience decay, with patience‑related hyper‑parameters configurable.
2.3 Retention Module
The Retention Module predicts the time until the user returns, modeled as a geometric distribution. It combines global, personal, and feedback‑dependent retention biases, reflecting activity levels and the effect of recommendation quality on return probability.
3 Benchmark Results and Analysis
3.1 Experimental Setup
Datasets: KuaiRand and ML‑1M. Evaluation: task‑specific metrics for list‑wise, whole‑session, and cross‑session tasks.
List‑wise recommendation (request‑level) : L‑reward (average instant reward), Coverage, Intra‑list diversity (ILD).
Sequential recommendation (whole‑session) : Whole‑session reward, average reward, Depth (number of interactions before leaving).
Retention optimization (cross‑session) : Return time and user retention rate.
3.2 Benchmark Results
For list‑wise recommendation, ListCVAE achieves the highest reward and diversity, while PRM performs worst. For whole‑session sequential recommendation, the HAC framework consistently outperforms others, with A2C being the least stable. For retention optimization, RLUR surpasses TD3 and CEM, showing the potential of advanced RL methods.
3.3 Comparison with Existing Simulators
Qualitative analysis shows existing simulators ignore long‑term feedback and support only a single task, whereas KuaiSim satisfies all requirements. Quantitative analysis on KuaiRand demonstrates KuaiSim’s superior fidelity (higher AUC for click prediction) and better agent training performance across depth, average reward, and total reward metrics.
3.4 Data Transferability
KuaiSim was also instantiated on the ML‑1M dataset; benchmark results confirm that HAC remains top‑performing, DDPG achieves the highest coverage and diversity, and TD3 performs poorly, indicating KuaiSim’s adaptability to different datasets.
4 Conclusion
KuaiSim is a versatile, multi‑level user simulator that establishes strong baselines for reinforcement‑learning‑based recommender systems, offers comprehensive evaluation protocols, and demonstrates effective data transfer across datasets, thereby advancing research in recommendation technologies.
Kuaishou Tech
Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.