Artificial Intelligence 25 min read

Reinforcement Learning for Cold‑Start Job Recommendation in 58.com

This talk explains how 58.com tackles the cold‑start and interest‑divergence problems of its massive blue‑collar job recruitment platform by modeling the recommendation process as a reinforcement‑learning task, detailing the use of multi‑armed bandit, contextual bandit, and linear‑UCB algorithms, offline evaluation pipelines, online deployment, and observed performance gains.

58 Tech
58 Tech
58 Tech
Reinforcement Learning for Cold‑Start Job Recommendation in 58.com

58.com operates the largest blue‑collar recruitment platform in China, handling tens of millions of job postings and a comparable number of job seekers daily, which creates a classic two‑sided matching problem with severe cold‑start and user‑interest‑divergence challenges.

The presentation first outlines the recruitment workflow, highlighting that job seekers express interest by clicking, applying, or contacting recruiters, while recruiters respond, ultimately leading to interviews and hires. Two main issues are identified: (1) cold‑start for new users, and (2) scattered user interests across many job categories.

Reinforcement learning (RL) is introduced as a trial‑and‑error framework that maximizes cumulative reward without explicit supervision. The four RL components—state, action, reward, and transition probability—are mapped to recommendation concepts: user features or context as state, recommendation decisions as actions, click/apply metrics as rewards, and the updated user profile after interaction as the new state.

Several RL‑based algorithms are discussed. A simple multi‑armed bandit (MAB) approach treats each job category as an arm; variants such as epsilon‑greedy, Upper‑Confidence‑Bound (UCB), and Thompson sampling are described. To incorporate contextual information (e.g., location, gender, age), the contextual MAB (CMAB) model is adopted, allowing the policy to condition on user attributes.

Because the action space (thousands of job categories) is too large for direct RL, the system groups jobs into buckets based on user attributes and uses epsilon‑greedy with linear‑UCB to balance exploration and exploitation. Bucket construction, reward aggregation across multiple recall channels, and state‑vector updates (initialized uniformly) are detailed.

An offline simulation framework is built using real user sequences to emulate cold‑start to informed‑user transitions, enabling rapid iteration on bucket design, algorithm choice, and hyper‑parameters before online rollout.

Online deployment leverages Redis for bucket and user‑state storage, Spark Streaming for real‑time behavior logging, and a lazy‑update mechanism that refreshes user vectors only when the user enters the recommendation flow. This architecture keeps latency low (sub‑second) and ensures high update coverage.

Experimental results show that the RL‑driven cold‑start solution improves new‑user click‑through rate and, more importantly, increases the first‑time application conversion by about 8% compared with baseline supervised models.

Future directions include extending the RL‑based recall to all users, exploring multi‑objective optimization, and investigating deeper RL architectures such as RNN‑based policies.

Reinforcement Learningcold startMulti-armed banditcontextual banditjob recommendationonline evaluation
58 Tech
Written by

58 Tech

Official tech channel of 58, a platform for tech innovation, sharing, and communication.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.