OneReason: When Recommendation Systems Learn to Reason

The OneReason report details how Kuaishou’s recommendation team injects reasoning into large‑scale recommender models through a four‑level pre‑training pipeline, chain‑of‑thought (CoT) fine‑tuning, and specialized reinforcement learning, achieving significant offline gains and a 10.33% exposure lift in a live A/B test.

Machine Heart
Machine Heart
Machine Heart
OneReason: When Recommendation Systems Learn to Reason

Background

Over the past decade, recommendation systems have relied on scaling statistical co‑occurrence models—from collaborative filtering to generative OneRec series—by increasing memory, parameters, and sequence length. However, in the LLM era pure scaling hits hard walls such as cold‑start users, long‑tail items, cross‑domain transfer, and multi‑objective weighting.

Large language models (LLMs) have moved from scaling to reasoning and agentic capabilities (e.g., OpenAI o1, DeepSeek R1). The report argues that recommendation systems need a similar reasoning stage to unlock new growth.

Why Reasoning in Recommendation Is Not a Simple LLM Copy

Recommendation reasoning must address three intrinsic problems:

Cause‑effect ("why") inference: User actions are the effect; the model must infer the underlying intent (cause) from noisy, sparse, cross‑domain sequences.

Explainable, interveneable cognition: A reasoning base model makes the decision process explicit in a CoT trace, allowing business constraints to be written directly into the reasoning layer and shortening strategy iteration cycles.

Foundation for agentic recommender systems: Future agents that plan, call tools, and hold multi‑turn dialogs require a base model that understands item semantics, reasons reliably, and follows instructions.

How to Build Recommendation CoT?

The authors first study the foundations of reasoning in multimodal models and identify two prerequisites:

Deep semantic alignment between modality/token spaces.

A clear, hierarchical, coarse‑to‑fine reasoning chain.

In recommendation, these translate to:

Itemic tokens lack deep semantic links to natural language; they are treated as discrete IDs.

Simply mixing generic reasoning data with the CoT format does not yield recommendation‑specific logical chains.

Recommendation reasoning differs from mathematical reasoning: it is abductive (inferring causes) rather than deductive, requiring the model to compress noisy behavior into plausible interest hypotheses before reaching a decision.

Four‑Level Reasoning Capabilities (R0‑R3)

R0 Perception: Understand each itemic token and map it to an interest point.

R1 Deduction: Learn item‑to‑item relations and the commonsense reasons behind them.

R2 Evolution: Model long‑term user interest evolution.

R3 Recommendation: Combine the above to produce high‑quality, cross‑domain recommendations.

OneReason Pre‑training Design

OneReason pre‑trains a 578 B‑token corpus organized into four hierarchical layers: Token, Item, Relational, and User. The three‑stage training schedule is:

Warm‑up (110 B tokens): Freeze the backbone, train new item embeddings to embed items into the LLM semantic space.

Full‑parameter training (449 B tokens): Jointly align all four layers.

Long‑sequence optimization (19 B tokens): Expand the context window to 32 K tokens for long user histories.

This design resolves the item‑text semantic gap that limited earlier OpenOneRec models.

Data Granularity

Token level: Single‑token meaning, prefix prediction, hierarchical inference.

Item level: Coarse filtering of redundant details, dual QA mapping between item content and text.

Relational level: Convert implicit collaborative signals into explicit textual transition chains (item → description → next item).

User level: Time‑ordered cross‑domain behavior streams, with random item‑to‑text substitution for full‑scene alignment.

SFT (Supervised Fine‑Tuning) Design

After pre‑training, the model can understand item semantics but still needs to learn recommendation‑specific reasoning. The SFT stage introduces a CoT format that encodes three modules:

Persona Abstraction: Define 20 user‑type personas (e.g., family‑oriented, live‑shopping enthusiast) and extract them from noisy histories.

Interest Expansion: Generate a compact set of candidate interests (n = 1, 3, 5) to avoid “over‑thinking”. Experiments show that a small n yields the best performance.

Transition Inference: Evaluate candidate directions using evidence strength, recency, persona match, domain compatibility, and safety, then produce the final recommendation.

The SFT data distribution (≈ 1.5 M samples) includes R0‑R3 objectives, itemic‑instruction samples, and generic instruction samples.

CoT Quality Evaluation

The authors define five dimensions to detect common failure modes:

Safety: No leakage of item IDs or titles.

Consistency: Alignment between the final recommendation and the intended goal.

Logic: Whether the chain truly abstracts user behavior rather than merely echoing it.

Factuality: Strict grounding in real user sequences.

Informativeness: Presence of concrete, insightful explanations.

Reinforcement Learning (RL) Design

Because SFT merely imitates existing data, RL is introduced to let the model explore better reasoning paths.

Two‑stage trajectory generation: First generate a reasoning trace, then expand multiple candidate recommendations from the same trace, increasing effective reward density.

Set‑wise reward: Evaluate a list of candidates jointly for coverage and diversity, encouraging multi‑interest exploration.

Stabilized training: Different clipping ranges for reasoning tokens vs. itemic tokens and down‑weighting non‑hit samples reduce gradient noise.

Domain specialization is handled by a “Specialize‑then‑Unify” pipeline: train separate experts per domain (video, e‑commerce, ads, live) and then merge them via either Rejection‑Sampling Fine‑Tuning (RFT) or Multi‑Teacher On‑Policy Distillation (MOPD). RFT preserves high‑quality expert traces; MOPD yields broader knowledge transfer.

Benchmark and Evaluation

OneReason‑Bench decomposes recommendation ability into four progressive levels (R0‑R3) and provides tasks such as item understanding, item‑to‑item QA, interest chain extraction, and final recommendation. The benchmark covers short video, product, ad, and live‑stream domains.

Experimental Results

Key findings:

Thinking (CoT) models outperform non‑thinking baselines across all four domains after RL; e.g., short‑video Pass@4 improves > 60% over the best LC‑Rec baseline.

SFT‑only CoT often harms performance (over‑thinking), confirming the need for RL to unlock reasoning benefits.

Four‑level pre‑training raises R0 item perception by 160.5% and R3 cross‑domain recommendation by 65.1%.

Using OneReason weights, an ID‑based model’s ad‑domain hit rate increases nearly 5×.

CoT improves non‑thinking performance when mixed with unCoT data (optimal ratios differ per domain).

Likelihood analysis shows that before RL the CoT prefix reduces target item likelihood, while after RL it becomes positive across all domains, proving that high‑quality trajectories are essential.

Case Study

A real‑world case targets a “Delta Force” equipment video. The user’s history contains only a single weak ad click for the game, with dominant interactions on other shooters. SFT predicts the next item as another popular shooter (e.g., “Peace Elite”), whereas the RL‑enhanced model infers the deeper intent—interest in new tactical‑shooter mechanics—and correctly recommends the Delta Force video, demonstrating multi‑hop reasoning.

Business Impact

In a 10‑day online A/B experiment on Kuaishou’s local‑life ad feed, OneReason achieved:

+10.33% exposure

+8.23% ad revenue

ROI > 5

The deployment uses a Fast‑Slow Thinking architecture: a slow‑thinking OneReason service provides high‑quality recall, while a fast‑thinking OneRec service handles real‑time scoring. Combined, the system yields a multi‑billion‑RMB annual revenue lift.

Conclusion and Outlook

The report answers three core questions:

Can recommender bases reason? Yes—provided itemic‑token perception is aligned and a proper CoT format is used.

What should recommendation CoT look like? A three‑stage chain of Persona → Interest Expansion → Transition Inference.

Can reasoning models be deployed at scale? Yes—Fast‑Slow Thinking demonstrates industrial feasibility with strong ROI. Future work will explore agentic recommender harnesses that plan and invoke tools, moving toward fully agentic recommendation systems. The authors plan to open‑source OneReason model weights and further technical details.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

recommendationLLMreasoningreinforcement learningPretrainingindustryCoT
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.