Artificial Intelligence 26 min read

Applying Reinforcement Learning to Solve Cold‑Start Problems in 58.com Job Recruitment

This talk explains how 58.com’s massive blue‑collar recruitment platform uses reinforcement‑learning techniques—including multi‑armed bandits, contextual MAB, and linear UCB—to address cold‑start and interest‑divergence challenges, describes the system architecture, offline evaluation, online deployment, and reports an 8% uplift in new‑user conversion.

DataFunTalk
DataFunTalk
DataFunTalk
Applying Reinforcement Learning to Solve Cold‑Start Problems in 58.com Job Recruitment

58.com’s recruitment service is the largest blue‑collar job platform in China, handling millions of daily postings and user interactions, which creates a classic two‑sided matching problem with severe cold‑start and user‑interest‑divergence issues.

The presentation first outlines the recruitment workflow, then introduces reinforcement learning (RL) as a trial‑and‑error approach that optimizes cumulative reward without requiring explicit supervision, highlighting the differences from supervised learning such as delayed reward.

RL elements are mapped to recommendation: state = user features or behavior sequence, action = recommendation decision (job category, specific job, ranking, etc.), transition = user’s new state after interaction, and reward = metrics like click‑through rate, conversion, or dwell time.

Several RL‑based algorithms are discussed. The simplest is the Multi‑Armed Bandit (MAB) framework, with concrete algorithms such as epsilon‑greedy, Upper Confidence Bound (UCB), and Thompson sampling. Their limitations—lack of context and slow adaptation—lead to the contextual MAB (CMAB) approach, which incorporates user attributes (region, age, gender, salary expectations) into the decision process.

Linear UCB is introduced as a contextual bandit that assumes a linear relationship between features and reward, enabling efficient exploration‑exploitation via a confidence‑bound term. This method has shown strong performance on cold‑start scenarios.

For a full RL solution, the problem is modeled as a Markov Decision Process (MDP) to handle delayed rewards and large state/action spaces. Model‑free, off‑policy methods (e.g., experience replay, double networks) are employed to meet the platform’s real‑time constraints and business‑impact requirements.

The offline evaluation pipeline simulates user sequences using real interaction logs, matches recommended jobs to user profiles, and computes similarity scores for both structured resume data and behavior‑based embeddings, allowing rapid iteration of algorithmic hyper‑parameters before online rollout.

Online deployment uses Redis for daily‑updated job buckets and user vectors, Spark Streaming for real‑time behavior capture, and a logging system that records state, action, and reward for subsequent offline RL training.

Experimental results show that the RL‑driven cold‑start strategy improves new‑user click‑through and job‑submission conversion by roughly 8% in A/B tests, demonstrating the practical value of RL in large‑scale recommendation.

Future work includes extending the RL framework to all users (not just cold‑start), exploring multi‑objective parameter fusion, and investigating deeper sequential models such as RNN‑based policies.

recommendation systemreinforcement learningCold Startonline learningMulti-armed banditcontextual MABjob recruitment
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.