Artificial Intelligence 17 min read

Applying Reinforcement Learning to Optimize Advertising Bidding ROI

This article presents a comprehensive overview of using reinforcement learning to solve advertising bidding ROI optimization, covering historical foundations, methodological reasoning, system architecture, practical implementation details, challenges, evaluation metrics, and recommended algorithms for real‑world ad placement scenarios.

IEG Growth Platform Technology Team

Jan 10, 2022

Applying Reinforcement Learning to Optimize Advertising Bidding ROI

Author: biglongyuan, Tencent IEG Growth Platform Application Researcher

The article shares the team’s practice of using reinforcement learning to solve advertising ROI optimization, divided into two parts: tracing the origins and theory (knowledge‑action integration) and the practical reinforcement‑learning‑based bidding implementation.

Quick read:

Part 1 – Tracing origins, knowledge‑action integration

Part 2 – Reinforcement‑learning‑based bidding practice

In solving business problems, two steps are crucial: first, clarify cause and effect to understand the problem’s essence; second, go beyond algorithmic theory to grasp both the ‘what’ and the ‘why’, enabling a unified methodology that truly masters and applies the knowledge.

1. Tracing Origins, Knowledge‑Action Integration

(This section references the reinforcement‑learning survey http://www.icdai.org/ibbb/2019/ID-0004.pdf )

In 1911, Thorndike proposed the law of effect: behaviors that produce pleasant outcomes become stronger, while unpleasant outcomes weaken the behavior‑context association.

Early reinforcement learning followed two independent lines: trial‑and‑error learning from animal psychology and optimal control using value functions and dynamic programming. These later merged, with temporal‑difference (TD) learning bridging the gap.

Dynamic programming solves stochastic optimal control but suffers from the “curse of dimensionality”. Modern reinforcement learning revived trial‑and‑error ideas, culminating in algorithms such as Actor‑Critic (Sutton 1981), TD(λ) (Sutton 1988), Q‑learning (Watkins 1989), and later deep RL methods like DQN (DeepMind 2015) that are seen as steps toward AGI.

For more application cases, see the Zhihu article https://zhuanlan.zhihu.com/p/78191585 .

2. Reinforcement‑Learning‑Based Advertising Bidding Practice

Initially, other methods were tried; reinforcement learning proved most effective for the relatively closed, optimization‑focused bidding scenario.

Problem insight: given a state, optimize the bid of each creative to maximize overall delivery efficiency.

State definition includes ad slot, creative, user dimensions (pCTR, pCVR), and estimated exposure value (eCPM).

2.1 System Architecture

The ROI‑optimizing solution consists of an offline computation module and an online reinforcement‑learning module, illustrated below.

Offline computation has two parts:

1. Comprehensive media efficiency evaluation (general ROI, shallow‑layer efficiency vs. exploration).

2. Reinforcement learning: under budget constraints, adjust bids per creative state to maximize overall performance.

2.2 Why Use Reinforcement Learning

The problem is a constrained linear program with many ad slots and frequent creative turnover, making dual‑method solutions complex.

Initial solution approximated the optimal point via geometric properties and importance sampling within the closed feasible region, but could not incorporate user‑side information.

Reinforcement learning fits because it satisfies the reward maximization and Markov assumptions required for this setting.

2.3 Practice

Practical experience shows that algorithm choice must be validated experimentally; good theory does not guarantee success without proper reward design and environment modeling.

Reinforcement‑learning algorithm families used include A3C, PPO, TD3, TRPO, DDPG, DQN, and DDQN (links to papers provided).

2.4.1 Evaluation of Effects

Following the seminal DeepMind DQN paper, evaluation methods include average episode return and average Q‑value across a fixed set of state‑action pairs; the latter provides smoother trends.

Performance fluctuations arise because RL agents overfit to recent states, causing episode‑to‑episode score variance.

A 2020 paper proposes four criteria for evaluation metrics (scientific, usability, reproducibility, non‑exploitative) and introduces performance percentiles, though practical usefulness is limited.

2.4.2 Training Environment Construction

Inspired by DeepMind’s DQN on Atari, a training environment should start simple, be efficient, extensible, then be refined to accurately simulate the business scenario and allow reward shaping.

Data preprocessing involves clustering exposure logs by ad category, removing noise, and applying importance sampling based on eCPM.

2.4.3 Algorithm Hyper‑parameter Tuning

Common algorithms (DQN, TD3, PPO, SAC) are chosen based on action space: D3QN for discrete, TD3 for continuous (or PPO/SAC if tuning expertise is limited).

Key hyper‑parameters include network width/depth, dropout rate, batch‑normalization handling, replay memory size, batch size, update frequency, and discount factor.

Exploration strategies: epsilon‑greedy for discrete actions and noisy‑networks for continuous actions.

Online results showed an approximate 80% ROI improvement.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Advertising reinforcement learning online advertising ad bidding ROI optimization

Written by

IEG Growth Platform Technology Team

Official account of Tencent IEG Growth Platform Technology Team, showcasing cutting‑edge achievements across front‑end, back‑end, client, algorithm, testing and other domains.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.