Fundamentals 12 min read

Bayesian A/B Testing with PyMC3: A Practical Guide

This article introduces the motivation and logic behind A/B testing, highlights common misunderstandings of p‑values, and demonstrates how Bayesian A/B testing using PyMC3 can provide intuitive probability statements about which variant performs better, complete with Python code examples.

DataFunTalk
DataFunTalk
DataFunTalk
Bayesian A/B Testing with PyMC3: A Practical Guide

Choosing between two options can be done with A/B testing. This article explains the motivation behind A/B tests, the pitfalls of p‑values, and introduces a Bayesian approach that avoids p‑value misinterpretations.

Imagine an online store with 10,000 daily visitors and a conversion rate of about 1%. By randomly assigning half the visitors to see a blue button (control) and half to see a red button (variant), you can measure which button yields a higher conversion rate.

Randomization must be truly random; otherwise, confounding factors such as gender or time‑of‑week could bias the results.

Preparing the A/B test

Assume you have collected data for 10,000 visitors, encoded purchases as 1 and non‑purchases as 0. The following Python code simulates the data:

import numpy as np
np.random.seed(0)
blue_conversions = np.random.binomial(1, 0.01, size=4800)
red_conversions = np.random.binomial(1, 0.012, size=5200)

Printing the simulated arrays shows mostly zeros, reflecting the low conversion rates.

print(blue_conversions)
# output: [0 0 0 ... 0 0 0]
print(red_conversions)
# output: [0 0 0 ... 0 0 0]

Calculating the observed conversion rates:

print(f'Blue: {blue_conversions.mean():.3%}')
print(f'Red: {red_conversions.mean():.3%}')
# output: Blue: 0.854%, Red: 1.135%

These numbers suggest the red button may be better, but we need statistical evidence to rule out chance.

Traditional (frequentist) approach

Using Welch's t‑test via SciPy yields a p‑value of 7.8%:

from scipy.stats import ttest_ind
print(f'p-value: {ttest_ind(blue_conversions, red_conversions, equal_var=False, alternative="less").pvalue:.1%}')
# output: p-value: 7.8%

Because 7.8% > 5%, we fail to reject the null hypothesis, and the result is inconclusive. The article also lists common misconceptions about p‑values.

Bayesian A/B testing advantages

Provides a direct probability that one variant is better than the other.

Requires only a generative model and Bayesian inference, not a suite of statistical tests.

Using PyMC3, we model the conversion rates with Beta(1, 99) priors and Bernoulli likelihoods:

import pymc3 as pm
with pm.Model():
    blue_rate = pm.Beta('blue_rate', 1, 99)
    red_rate = pm.Beta('red_rate', 1, 99)
    blue_obs = pm.Bernoulli('blue_obs', blue_rate, observed=blue_conversions)
    red_obs = pm.Bernoulli('red_obs', red_rate, observed=red_conversions)
    trace = pm.sample(return_inferencedata=True)

Posterior analysis shows maximum‑likelihood estimates of 0.854% for blue and 1.135% for red, with credible intervals.

To answer the key question—"What is the probability that the red variant is better?"—we compare posterior samples:

blue_rate_samples = trace.posterior['blue_rate'].values
red_rate_samples = trace.posterior['red_rate'].values
print(f'Probability that red is better: {(red_rate_samples > blue_rate_samples).mean():.1%}.')
# output (for me): Probability that red is better: 91.7%.

The result indicates roughly a 92% chance that the red button outperforms the blue one, a clear and intuitive metric for decision‑makers.

Conclusion

A/B testing—whether classic or Bayesian—allows you to isolate the effect of a single change (e.g., button color) by randomizing users into control and treatment groups. Bayesian A/B testing avoids the confusing interpretation of p‑values and delivers a probability that directly answers business questions, all with relatively little Python code.

pythonA/B testingexperiment designconversion rateBayesian statisticsPyMC3
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.