Multi‑Agent Reinforcement Learning Based Full‑Chain Computation Allocation (MaRCA) for Advertising Systems

MaRCA, a multi‑agent reinforcement‑learning framework, allocates compute across JD’s advertising playback chain by jointly estimating user value, resource consumption, and action outcomes while dynamically adjusting to real‑time load, achieving roughly 15 % higher ad revenue without extra compute resources.

JD Retail Technology

Mar 18, 2025

Multi‑Agent Reinforcement Learning Based Full‑Chain Computation Allocation (MaRCA) for Advertising Systems

This article introduces MaRCA (Multi‑Agent Reinforcement Learning Computation Allocation), a full‑chain computation scheduling solution that tackles the complex game of maximizing commercial value under large traffic fluctuations, heterogeneous request values, and limited compute resources. By constructing modules for user‑value estimation, compute‑resource estimation, action‑value estimation, and load‑aware decision making, MaRCA models the upstream‑downstream cooperation of the advertising playback chain as a multi‑agent reinforcement learning (MARL) problem. Centralized training and distributed execution reduce system risk while significantly increasing ad revenue, advancing compute scheduling from a generic tool to an intelligent core infrastructure for recommendation systems.

Background – In JD’s out‑of‑site advertising, billions of user requests per day must be processed within sub‑hundred‑millisecond latency constraints. Traffic volume varies by time slot, and request values differ across media platforms and user groups. Most requests generate no revenue, requiring fine‑grained compute allocation that favors high‑value traffic.

Problem Modeling – The state space S includes user features, traffic features, and IDC information. The action space A consists of three types of compute actions: link‑selection, switch‑type, and queue‑type. Reward R(s,a)=Q(s,a)-λC(s,a) balances ad consumption Q(s,a) against compute consumption C(s,a) with a load‑balancing factor λ. Linear programming with Lagrangian duality yields the optimal solution.

Overall Architecture – MaRCA comprises four modules:

User‑Value Estimation : predicts per‑request ad revenue using a Deep & Cross Network (DCN) enhanced with Poisson loss and value‑bucket segmentation.

Compute‑Resource Estimation : predicts compute consumption for each action combination via a two‑stage approach (action‑result prediction with DCN+MMoE, followed by queue‑type regression using monotonic polynomial fitting).

Action‑Value Estimation : estimates ad consumption for each action combination, handling the collaborative relationship between recall and ranking agents using Adaptive Weighted Ensemble DRQN and a Mixing Network.

Load‑Aware Decision Module : perceives real‑time CPU load and elastic‑degradation status, adjusts λ via feedback, and selects the optimal action set under load constraints.

The Adaptive Weighted Ensemble DRQN combines DRQN (GRU‑based Q‑network) with multiple Q‑heads weighted by prediction error, while the Mixing Network aggregates individual agent Q‑values into a global Q total using a Softplus‑based monotonic architecture.

Experimental Results – Offline and online A/B tests during JD’s 2024 618 and 11.11 promotions show a 14.93% increase in ad consumption with unchanged compute resources. System reliability and intelligence improved markedly, stabilizing performance during traffic spikes.

Future Work – Plans include integrating Model Predictive Control (MPC) for proactive load‑aware decisions, expanding the action space with model selection and filtering strategies, and applying the framework to other recommendation pipelines with tight compute budgets.

References

Jiang et al., “DCAF: A Dynamic Computation Resource Allocation Framework for Online Serving System,” arXiv 2020.

Yang et al., “Computation Resource Allocation Solution in Recommender Systems,” arXiv 2021.

... (additional citations omitted for brevity)

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Advertising deep learning reinforcement learning resource allocation Compute Scheduling multi‑agent

Written by

JD Retail Technology

Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.