OneSearch‑V2 Launches: Self‑Distilled Generative Search That Truly Understands Users

OneSearch‑V2 introduces a latent‑reasoning enhanced self‑distillation framework that augments query understanding with thought‑augmented CoT, aligns preferences via direct user behavior feedback, and achieves up to 4 % CTR lift and significant order growth without adding inference cost or latency.

Kuaishou Tech
Kuaishou Tech
Kuaishou Tech
OneSearch‑V2 Launches: Self‑Distilled Generative Search That Truly Understands Users

Background

In large‑scale e‑commerce search, three major bottlenecks limit generative retrieval: (1) insufficient understanding of complex or ambiguous queries, (2) weak personalization of user intent, and (3) reward models that over‑fit narrow historical preferences.

OneSearch V1 Review

OneSearch V1 reduced inference cost and improved performance on mid‑frequency queries, but its end‑to‑end generation struggled with short ambiguous queries and long‑tail variations. Although it contributed ~8% of conversions, it accounted for one‑third of page views.

Limitations of V1

Complex query understanding gaps (e.g., “indoor fitness equipment” vs. specific items).

Inadequate personalization; the model relied heavily on historical co‑occurrence.

Reward system prone to distribution shift and over‑fitting.

OneSearch V2 Design

Latent Reasoning‑Enhanced Self‑Distillation (LR‑ESD)

The LR‑ESD pipeline encodes keyword‑level chain‑of‑thought (CoT) reasoning into model weights. A teacher model receives the full query augmented with high‑density keywords, while a student receives the raw query. An information‑asymmetric KL loss forces the student to reproduce the teacher’s predictions, internalizing reasoning without extra parameters or inference tokens.

Thought‑Augmented Query Understanding

A three‑step pipeline processes each query:

Intent, category, and attribute extraction.

Keyword extraction using an LLM‑generated, high‑density keyword‑based CoT.

Preference calibration with user‑profile signals.

This “thought‑augmented” stage avoids long‑text generation, preserving semantic richness while keeping latency low.

Behavior‑Feedback Preference Alignment

Instead of a separate reward model, V2 directly incorporates real user actions (clicks, purchases) into a composite reward. The Token‑Position Marginal Advantage (TPMA‑GRPO) distributes credit across the hierarchical SID sequence, giving higher weight to early coarse tokens and aligning generation with downstream conversion signals.

Experimental Evaluation

Offline Results

On a 30 k query test set (≈30 000 PVs, 7 229 orders), V2 achieved Order HR@10 = 0.2314 and Click HR@10 = 0.2568, a 2.68 % absolute HR lift over V1. Ablation showed self‑distillation contributed the largest gain (Order HR@10 + 1.17 %, Click HR@10 + 1.67 %). Adding R‑Drop, FGM, and focal loss yielded synergistic improvements, reaching Order HR@10 = 0.2214 and Click HR@10 = 0.2471.

Online A/B Test Results

Deploying OneSearch‑V2 on the full platform increased overall CTR by 3.98 %, buyer count by 2.07 %, and order volume by 2.11 % without any increase in latency. Incremental deployments (RAG, Reasoning, Full) showed monotonic improvements.

Human Evaluation

Manual assessment of 3 200 query‑item pairs reported +1.37 % page quality, +0.55 % item relevance, and +1.65 % query‑item relevance.

In‑Depth Analysis

User/Query Frequency and Cold‑Start

V2 outperformed V1 across all user activity tiers, query frequencies, and item popularity levels. Gains were most pronounced for low‑activity users and cold‑start items, turning V1’s inverted‑U CTR curve into a balanced U‑shape.

Industry‑Level CTR Gains

Average CTR uplift was 3.98 % across all verticals. Categories with rich but ambiguous titles (apparel, cosmetics, hardware) saw the largest relative improvements, confirming the effectiveness of CoT‑driven semantic enhancement.

CoT Keyword Coverage

Online pipelines continuously refresh the CoT keyword repository. Coverage of complex queries rose steadily during March 2026, ensuring stable self‑distillation updates.

Relevance‑Conversion Trade‑off

TPMA‑GRPO balances relevance rewards with post‑hoc conversion signals. The RAG‑only variant improved relevance but lagged in conversion; the full V2 model achieved higher CTCVR (0.242 % vs. 0.231 %) by internalizing reasoning rather than merely boosting relevance scores.

TPMA Flexibility in Promotions

During the 3.18 shopping festival, TPMA was used to give higher relevance rewards to emerging merchants, successfully elevating their rankings and demonstrating real‑time, business‑driven reward adjustment.

Future Directions

Beyond‑log training strategies for long‑tail queries with scarce interaction data.

Unified SID encoding that jointly represents text, images, video, and live‑stream content.

Agentic search systems with efficient online learning that preserve latency guarantees.

Paper and Code

Full technical details are available in the arXiv paper: https://arxiv.org/abs/2603.24422

Implementation can be explored at the GitHub repository: https://github.com/benchen4395/onesearch-family

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

e-commerceLLMreinforcement learningquery understandingself-distillationgenerative searchbehavioral feedback
Kuaishou Tech
Written by

Kuaishou Tech

Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.