Multi-Agent Auto-bidding (MAAB): A Framework for Distributed Automatic Bidding in Online Advertising
The paper introduces MAAB, a scalable multi‑agent reinforcement‑learning framework for online ad bidding that uses temperature‑regularized credit assignment, adaptive threshold agents, and mean‑field clustering to balance individual advertiser utility, platform revenue, and overall social welfare in competitive auction environments.
In online advertising, automated bidding has become essential for advertisers to optimize performance metrics under budget constraints. Existing works mainly treat the problem from a single-agent perspective, ignoring the interactions among multiple bidding agents. This paper studies automated bidding from a distributed multi‑agent system viewpoint and proposes a general Multi‑Agent Auto‑bidding framework (MAAB) to learn bidding strategies.
Background. Traditional ad auctions require manual bidding for each impression, which is infeasible at scale. Platforms provide automated bidding services (e.g., Google AdWords, Baidu Fengchao, Alibaba Super Recommendation) that let advertisers specify goals and constraints while the platform optimizes bids. However, multiple automated bidding agents compete for the same impressions, creating complex cooperative‑competitive dynamics that affect both individual advertiser utility and overall system welfare.
Challenges. (1) Jointly optimizing individual utility and social welfare is difficult because pure competition can lead to monopolistic outcomes, while pure cooperation may cause collusive low‑price bidding that harms platform revenue. (2) Designing reward functions that balance these forces is non‑trivial, especially in real‑world simulators. (3) Scaling to millions of advertisers is computationally prohibitive if each advertiser is modeled as an independent agent.
Proposed Method (MAAB). MAAB consists of three key components:
Temperature‑Regularized Credit Assignment (TRCA): a reward‑allocation mechanism that distributes auction rewards to agents using a softmax weighting with a temperature parameter. Adjusting the temperature smoothly interpolates between competitive and cooperative regimes.
Threshold agents: auxiliary agents that set personalized bidding thresholds for each bidding agent during training, encouraging higher platform revenue while preventing collusive low‑price behavior.
Mean‑field approximation: advertisers with the same optimization goal are clustered into a single “average” bidding agent, drastically reducing the number of agents and enabling efficient training on industrial‑scale data.
Foundational Concepts. The paper formalizes the bidding problem as a partially observable Markov decision process (POMDP) where each agent’s action is a bid, observations include remaining budget, estimated value of the impression, and remaining bidding opportunities, and rewards are derived from auction outcomes (e.g., GSP payments). Independent Learner (IL) methods such as DQN are discussed, and the distinction between environment‑only rewards (CM‑IL) and total‑reward (CO‑IL) is highlighted.
Behavior Analysis of IL. Experiments with two agents under CM‑IL and CO‑IL reveal that pure competition can cause a “monopoly” effect (one agent dominates value capture) and lower social welfare, while pure cooperation improves welfare but reduces platform revenue due to collusive low bids.
Method Details. TRCA assigns each agent a weight \(w_i = \frac{\exp(r_i/\tau)}{\sum_j \exp(r_j/\tau)}\) and scales its reward accordingly. Threshold agents receive platform revenue as reward and are trained adversarially against bidding agents; a bar‑gate mechanism ensures bids respect the thresholds. The mean‑field approach aggregates agents by objective (e.g., click‑optimization, conversion‑optimization, cart‑addition) and learns a shared policy that is later individualized using each agent’s own state.
Experiments. Offline simulations on a 6‑hour Alibaba ad log (≈700k impressions, ~400 advertisers per impression) compare MAAB against baselines: manual bidding (MSB), single‑agent DQN (DQN‑S), CM‑IL, CO‑IL, and variants without threshold agents or with fixed thresholds. Results show MAAB achieves higher social welfare than CM‑IL and higher platform revenue than CO‑IL. Online A/B tests further confirm that MAAB improves normalized welfare with limited revenue loss. Ablation studies demonstrate the effectiveness of TRCA (by varying the temperature) and threshold agents (fixed vs. adaptive).
Conclusion. MAAB provides a scalable multi‑agent reinforcement learning framework for large‑scale automated bidding, balancing advertiser utility, platform revenue, and overall social welfare through TRCA, adaptive threshold agents, and mean‑field clustering. Future work includes dynamic temperature adjustment and improved reward designs for threshold agents.
Alimama Tech
Official Alimama tech channel, showcasing all of Alimama's technical innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.