Artificial Intelligence 12 min read

Generative Recommendation with DPO Alignment for JD Alliance Advertising: Multi‑Objective Optimization and Online Results

The paper presents a generative recommendation framework for JD Alliance advertising that combines semantic‑ID modeling, large‑model pre‑training and fine‑tuning, and Direct Preference Optimization (including Softmax‑DPO and β‑DPO) to jointly boost click‑through and conversion rates, achieving +0.6% UCTR and +8% UCVR in online tests while outlining future multi‑objective extensions.

JD Retail Technology
JD Retail Technology
JD Retail Technology
Generative Recommendation with DPO Alignment for JD Alliance Advertising: Multi‑Objective Optimization and Online Results

This article introduces the application of generative recommendation large models in JD Alliance advertising to improve online UCTR (click‑through rate) and UCVR (conversion rate). By leveraging a DPO‑based alignment paradigm, the authors aim to boost conversion while preserving click performance.

It first reviews existing generative recommendation methods and provides background on DPO (Direct Preference Optimization) alignment techniques.

01. From Traditional Recommendation to Generative Recommendation – The shift simplifies the pipeline and relies on the strong generalization and stability of large models. A related survey article is referenced for further reading.

02. How Generative Recommendation Works – A discrete index system (semantic IDs) for items is built using an RQ‑VAE framework. User behavior sequences are represented as ordered semantic IDs and fed into an autoregressive model. Auxiliary tasks (e.g., ID‑to‑item and item‑to‑ID prediction) are added. Experiments show recall improvements, especially in sparse‑data scenarios, when using SFT or pre‑train + SFT.

03. Multi‑Objective Optimization with DPO – For CPS advertising, the goal is to increase conversion while also encouraging clicks. The model is first pre‑trained on massive click data, then fine‑tuned (SFT) on recent user actions. Alignment is performed using DPO, with background on PPO in InstructGPT, Bradley‑Terry, and Plackett‑Luce models. Extensions such as Softmax‑DPO (handling multiple negatives) and β‑DPO (dynamic β adjustment) are introduced.

Data Construction – Three pairwise data schemes are explored: , , and . Offline hit@1 results show slight decreases in click metrics but notable gains in conversion metrics.

Target Function Adjustments – Multiple negatives are incorporated via Softmax‑DPO, and β‑DPO is applied to mitigate sensitivity to the β hyper‑parameter. Comparative offline results indicate modest improvements or trade‑offs relative to the baseline DPO model.

04. Online Performance – A small‑traffic A/B test demonstrates a +0.6% uplift in UCTR and an +8.0% uplift in UCVR, confirming the practical effectiveness of the alignment approach.

05. Future Plans – The authors plan to further explore DPO variants, alternative multi‑objective optimization methods (e.g., MRPO), and modeling of multi‑scenario, multi‑behavior contexts within the generative recommendation paradigm.

References – A list of cited papers covering audio codecs, instruction‑tuned language models, Direct Preference Optimization, Bradley‑Terry, Plackett‑Luce, Softmax‑DPO, β‑DPO, and related works.

advertisinglarge language modelsmulti-objective optimizationDPOgenerative recommendation
JD Retail Technology
Written by

JD Retail Technology

Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.