Multi‑Agent Reinforcement Learning Based Full‑Chain Computation Allocation (MaRCA) for Advertising Systems
MaRCA, a multi‑agent reinforcement‑learning framework, allocates compute across JD’s advertising playback chain by jointly estimating user value, resource consumption, and action outcomes while dynamically adjusting to real‑time load, achieving roughly 15 % higher ad revenue without extra compute resources.
This article introduces MaRCA (Multi‑Agent Reinforcement Learning Computation Allocation), a full‑chain computation scheduling solution that tackles the complex game of maximizing commercial value under large traffic fluctuations, heterogeneous request values, and limited compute resources. By constructing modules for user‑value estimation, compute‑resource estimation, action‑value estimation, and load‑aware decision making, MaRCA models the upstream‑downstream cooperation of the advertising playback chain as a multi‑agent reinforcement learning (MARL) problem. Centralized training and distributed execution reduce system risk while significantly increasing ad revenue, advancing compute scheduling from a generic tool to an intelligent core infrastructure for recommendation systems.
Background – In JD’s out‑of‑site advertising, billions of user requests per day must be processed within sub‑hundred‑millisecond latency constraints. Traffic volume varies by time slot, and request values differ across media platforms and user groups. Most requests generate no revenue, requiring fine‑grained compute allocation that favors high‑value traffic.
Problem Modeling – The state space S includes user features, traffic features, and IDC information. The action space A consists of three types of compute actions: link‑selection, switch‑type, and queue‑type. Reward R(s,a)=Q(s,a)-λC(s,a) balances ad consumption Q(s,a) against compute consumption C(s,a) with a load‑balancing factor λ . Linear programming with Lagrangian duality yields the optimal solution.
Overall Architecture – MaRCA comprises four modules:
User‑Value Estimation : predicts per‑request ad revenue using a Deep & Cross Network (DCN) enhanced with Poisson loss and value‑bucket segmentation.
Compute‑Resource Estimation : predicts compute consumption for each action combination via a two‑stage approach (action‑result prediction with DCN+MMoE, followed by queue‑type regression using monotonic polynomial fitting).
Action‑Value Estimation : estimates ad consumption for each action combination, handling the collaborative relationship between recall and ranking agents using Adaptive Weighted Ensemble DRQN and a Mixing Network.
Load‑Aware Decision Module : perceives real‑time CPU load and elastic‑degradation status, adjusts λ via feedback, and selects the optimal action set under load constraints.
The Adaptive Weighted Ensemble DRQN combines DRQN (GRU‑based Q‑network) with multiple Q‑heads weighted by prediction error, while the Mixing Network aggregates individual agent Q‑values into a global Q total using a Softplus‑based monotonic architecture.
Experimental Results – Offline and online A/B tests during JD’s 2024 618 and 11.11 promotions show a 14.93% increase in ad consumption with unchanged compute resources. System reliability and intelligence improved markedly, stabilizing performance during traffic spikes.
Future Work – Plans include integrating Model Predictive Control (MPC) for proactive load‑aware decisions, expanding the action space with model selection and filtering strategies, and applying the framework to other recommendation pipelines with tight compute budgets.
References
Jiang et al., “DCAF: A Dynamic Computation Resource Allocation Framework for Online Serving System,” arXiv 2020.
Yang et al., “Computation Resource Allocation Solution in Recommender Systems,” arXiv 2021.
... (additional citations omitted for brevity)
JD Retail Technology
Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.