Artificial Intelligence 10 min read

Reinforcement Learning for Recommendation System Mixing: Concepts, Practice, and Evaluation

This article explains how reinforcement learning, with its focus on maximizing long‑term reward, can improve recommendation system mixing by covering basic RL concepts, differences from supervised learning, multi‑armed bandit approaches, practical OpenAI Gym experiments, new AUC metrics, online gains, and advanced model optimizations.

DataFunTalk
DataFunTalk
DataFunTalk
Reinforcement Learning for Recommendation System Mixing: Concepts, Practice, and Evaluation

Compared with traditional supervised learning, reinforcement learning (RL) can maximize long‑term reward, which is especially valuable for recommendation systems that need to look beyond immediate clicks.

The article introduces RL basics, including the classic <A, S, R, P> tuple (Agent, State, Reward, and Model), and contrasts RL with supervised and unsupervised learning, highlighting its focus on long‑term gains.

It explains the multi‑armed bandit (MAB) problem as a core RL technique for exploration vs. exploitation, and mentions AlphaGo as an example that combines policy‑based and value‑based networks.

For hands‑on practice, the article suggests using OpenAI’s gym environment (CartPole) and provides a complete Q‑learning implementation:

import gym import random import numpy N_BINS = [5, 5, 5, 5] LEARNING_RATE=0.05 DISCOUNT_FACTOR=0.9 EPS = 0.3 MIN_VALUES = [-0.5,-2.0,-0.5,-3.0] MAX_VALUES = [0.5,2.0,0.5,3.0] BINS = [numpy.linspace(MIN_VALUES[i], MAX_VALUES[i], N_BINS[i]) for i in xrange(4)] def discretize(obs): return tuple([int(numpy.digitize(obs[i], BINS[i])) for i in xrange(4)]) qv = {} env = gym.make('CartPole-v0') print(env.action_space) print(env.observation_space) an = env.action_space.n def get(s, a): global qv if (s, a) not in qv: return 0 return qv[(s, a)] def update(s, a, s1, r): global qv nows = get(s, a) m0 = get(s1, 0) m1 = get(s1, 1) if m0 < m1: m0 = m1 qv[(s, a)] = nows + LEARNING_RATE * (r + DISCOUNT_FACTOR * m0 - nows) for i in range(500000): obs = env.reset() if i % 1000 == 0: print i for _ in range(5000): s = discretize(obs) s_0 = get(s, 0) nowa = 0 s_1 = get(s, 1) if s_1 > s_0: nowa = 1 if random.random() <= EPS: nowa = 1 - nowa obs, reward, done, info = env.step(nowa) s1 = discretize(obs) if done: reward = -10 update(s, nowa, s1, reward) if done: break for i_episode in range(1): obs = env.reset() for t in range(5000): env.render() s = discretize(obs) maxs = get(s, 0) maxa = 0 nows = get(s, 1) if nows > maxs: maxa = 1 obs, reward, done, info = env.step(maxa) if done: print("Episode finished after {} timesteps".format(t+1)) break

The article then discusses why RL is needed for recommendation mixing, pointing out challenges such as heterogeneous data, differing objectives across content types, high computational cost, and varying content quality.

It models the recommendation process as a Markov Decision Process where the system is the agent, recommended items are actions, and user feedback (clicks, negative feedback, exits) serves as reward.

To evaluate models offline, a new AUC metric is proposed that measures the probability that a pair of items with higher cumulative reward is ranked higher, arguing that this better reflects long‑term gains than traditional CTR‑based AUC.

Online experiments show a 7% increase in total dwell time compared with rule‑based mixing, and a 1‑2% improvement over a supervised learning baseline.

Further model optimizations are described, including session‑based recommendation using a Personal DQN with RNN‑encoded states, Bloom embedding with Dueling DQN to reduce hash collisions, and Double DQN (DDDQN) for more stable learning.

Negative feedback is incorporated as a negative reward, and focal loss is applied to address its sparsity, achieving a 19% reduction in negative feedback rate.

The article concludes with reflections on the similarity between RL’s actor‑critic architecture and GANs, suggesting potential fusion of the two approaches for future improvements.

Finally, the author thanks the audience and invites readers to like, share, and join the DataFunTalk community for further AI and big‑data discussions.

Artificial Intelligencerecommendation systemsreinforcement learningMulti-armed banditlong-term rewardOpenAI GymQ-learning
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.