Artificial Intelligence 14 min read

KuaiSim: A Comprehensive User Simulator for Reinforcement Learning in Recommendation Systems

KuaiSim is a comprehensive user simulation environment for recommendation systems that models immediate, long‑term, and cross‑session feedback, supports list‑wise, whole‑session, and retention tasks, provides baselines and evaluation metrics, and demonstrates superior performance on KuaiRand and ML‑1M datasets.

Kuaishou Tech
Kuaishou Tech
Kuaishou Tech
KuaiSim: A Comprehensive User Simulator for Reinforcement Learning in Recommendation Systems

Abstract

Reinforcement‑learning (RL) based recommender systems attract attention for learning optimal policies and maximizing long‑term user reward, but deploying RL models online and collecting real A/B data is costly. Simulators offer an offline alternative, yet existing ones suffer from limited feedback types, inconsistency with real data, evaluation challenges, and poor cross‑RS transfer. KuaiSim is a comprehensive user simulator that generates multi‑behavior and cross‑session signals, supporting three task levels: list‑wise recommendation, whole‑session sequential recommendation, and cross‑session retention optimization, and provides evaluation methods and baseline algorithms for future benchmarking.

1 Introduction

Deploying RL models directly in live environments is resource‑intensive; simulators enable offline training and evaluation of recommender models. Existing simulators lack long‑term feedback modeling, exhibit data distribution gaps with real logs, and provide limited evaluation tools.

2 KuaiSim Workflow

Figure 1 (right) shows a generic MDP where a recommender receives a user request and produces an action (a list of items). Each request consists of static user profile features and dynamic interaction history. KuaiSim consists of three modules: an instant feedback model, a leave‑session model, and a retention model, producing three feedback types—instant feedback, leave signal, and return‑time signal.

2.1 Instant Feedback Module

The Instant Feedback Module (UIRM) infers the latent user state and outputs probabilities for each instant feedback type. It uses an item_correlation function to penalize highly similar items, encouraging diversity. The module is pretrained on log data using binary cross‑entropy for each feedback type.

2.2 Leave Module

The Leave Module models user patience; patience decreases with each interaction and triggers a leave signal when it falls below a threshold. Instant reward derived from UIRM guides patience decay, with patience‑related hyper‑parameters configurable.

2.3 Retention Module

The Retention Module predicts the time until the user returns, modeled as a geometric distribution. It combines global, personal, and feedback‑dependent retention biases, reflecting activity levels and the effect of recommendation quality on return probability.

3 Benchmark Results and Analysis

3.1 Experimental Setup

Datasets: KuaiRand and ML‑1M. Evaluation: task‑specific metrics for list‑wise, whole‑session, and cross‑session tasks.

List‑wise recommendation (request‑level) : L‑reward (average instant reward), Coverage, Intra‑list diversity (ILD).

Sequential recommendation (whole‑session) : Whole‑session reward, average reward, Depth (number of interactions before leaving).

Retention optimization (cross‑session) : Return time and user retention rate.

3.2 Benchmark Results

For list‑wise recommendation, ListCVAE achieves the highest reward and diversity, while PRM performs worst. For whole‑session sequential recommendation, the HAC framework consistently outperforms others, with A2C being the least stable. For retention optimization, RLUR surpasses TD3 and CEM, showing the potential of advanced RL methods.

3.3 Comparison with Existing Simulators

Qualitative analysis shows existing simulators ignore long‑term feedback and support only a single task, whereas KuaiSim satisfies all requirements. Quantitative analysis on KuaiRand demonstrates KuaiSim’s superior fidelity (higher AUC for click prediction) and better agent training performance across depth, average reward, and total reward metrics.

3.4 Data Transferability

KuaiSim was also instantiated on the ML‑1M dataset; benchmark results confirm that HAC remains top‑performing, DDPG achieves the highest coverage and diversity, and TD3 performs poorly, indicating KuaiSim’s adaptability to different datasets.

4 Conclusion

KuaiSim is a versatile, multi‑level user simulator that establishes strong baselines for reinforcement‑learning‑based recommender systems, offers comprehensive evaluation protocols, and demonstrates effective data transfer across datasets, thereby advancing research in recommendation technologies.

Benchmarkreinforcement learningrecommender systemsKuaiSimUser Simulation
Kuaishou Tech
Written by

Kuaishou Tech

Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.