Artificial Intelligence 16 min read

MaRCA: Multi‑Agent Reinforcement Learning Computation Allocation for Full‑Chain Ad Serving

This article presents MaRCA, a multi‑agent reinforcement learning framework that allocates computation resources across the full ad‑serving chain by modeling user value, compute consumption, and action rewards, enabling fine‑grained power‑tilting toward high‑quality traffic and achieving significant business gains under strict latency constraints.

JD Tech Talk
JD Tech Talk
JD Tech Talk
MaRCA: Multi‑Agent Reinforcement Learning Computation Allocation for Full‑Chain Ad Serving

Background

With the full‑chain optimization of search‑wide advertising entering a deep‑water stage, the marginal benefit of machine‑driven growth is diminishing. Offline ad services handle billions of user requests daily with sub‑hundred‑millisecond latency constraints, making compute allocation under limited resources a critical challenge, especially given traffic volatility and value disparity.

Problem Modeling

Definition : At time t , the system state is s_t . Each module has a load constraint C_m . The goal is to select an action combination a_t that maximizes reward R(s_t, a_t) while respecting compute consumption C(s_t, a_t) .

State space (S) : user features, traffic features, IDC information, etc.

Action space (A) : three categories of full‑chain compute actions – link‑selection, switch, and queue decisions.

Action value : expected ad consumption for a given state‑action pair.

Compute consumption : estimated CPU usage for the action.

Action reward : R(s,a)=Q(s,a)-λ·C(s,a) , where λ is a Lagrange multiplier balancing compute cost.

The optimization problem is formulated as a constrained linear program maximizing total reward across all states and actions.

Overall Solution – MaRCA

MaRCA (Multi‑Agent Reinforcement Learning Computation Allocation) builds four estimation modules and a load‑aware decision module, leveraging the physical ownership of machines to model upstream‑downstream collaboration as a multi‑agent RL problem. Centralized training and distributed execution allow stable cluster operation while maximizing overall revenue.

Module Decomposition

User‑Value Estimation : predicts per‑request ad revenue using a Deep Cross Network (DCN) with Poisson loss and value bucketing.

Compute‑Estimation : predicts compute consumption for each action combination via a two‑stage approach – (1) request‑level result prediction (DCN+MMoE) and (2) queue‑type consumption estimation using measurement, monotonic polynomial regression, and similar techniques for switch‑type and link‑selection actions.

Action‑Value Estimation : employs a multi‑agent DRQN with adaptive weighted ensemble of K Q‑heads, handling partial observability in recall agents and providing stable value estimates.

Load‑Aware Decision : combines the estimated user value, compute consumption, and action value; monitors real‑time CPU load, elastic downgrade status, and system pressure; adjusts the compute‑balance factor λ via feedback to keep module loads near target C_m .

Key Techniques

Adaptive Weighted Ensemble DRQN : integrates multiple Q‑heads weighted by prediction error to improve stability in partially observable environments.

Mixing Network : aggregates individual agent Q‑values into a global joint Q using a monotonic Softplus‑based network, enabling cooperative decision making without communication overhead.

Experimental Results

Offline and online tests on JD’s advertising platform show that MaRCA improves ad consumption by +14.93% while keeping system resources unchanged. Reliability and intelligence of the playback system are markedly enhanced, mitigating traffic spikes during peak periods and large‑scale promotions.

Future Outlook

Planned enhancements include model‑predictive‑control‑based load‑aware agents for proactive λ adjustment, expansion of the action space with model selection and filtering strategies, and broader application of the framework to other recommendation pipelines facing tight compute budgets.

Load Balancingreinforcement learningmulti-agentAI optimizationad servingcomputation allocation
JD Tech Talk
Written by

JD Tech Talk

Official JD Tech public account delivering best practices and technology innovation.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.