Product Management 15 min read

Common Pitfalls in AB Testing: Design and Analysis Issues

AB testing often fails because practitioners skip power analysis, peek at interim results, set unrealistic null hypotheses, randomize at inappropriate units, ignore sample‑ratio mismatches, choose misleading metrics, and fall prey to segmentation errors like Simpson’s paradox, any of which can invalidate conclusions.

Alimama Tech

Nov 3, 2021

Common Pitfalls in AB Testing: Design and Analysis Issues

AB testing is conceptually simple, but correctly applying it in real business scenarios is challenging. Practitioners need a deep understanding of both experimental techniques and business characteristics. This article reviews typical "pitfalls" in AB testing, focusing on experiment design and analysis.

1. Pitfalls in Experiment Design

1.1 Lack of Power Analysis

Power analysis determines the required sample size and is a fundamental step of AB testing. In practice it is often ignored for two reasons:

Missing basic parameters – The three key parameters are significance level (commonly 0.05), statistical power (commonly 0.8), and the minimum detectable effect. While the first two are standard, the minimum effect is highly subjective and varies across business goals, making consensus difficult.

Sample size not controlled by the experimenter – Sample size is usually dictated by business logic (e.g., traffic allocation and experiment duration) rather than by a prior power calculation.

In traffic‑based experiments, sample size depends on traffic proportion and experiment period. Business teams often set a maximum traffic share (e.g., ≤1%) and a fixed duration (e.g., two weeks), limiting the ability to adjust sample size.

In customer‑level experiments, the randomization unit is the customer, and the total eligible customer pool is often small, so increasing sample size by extending the experiment is impossible.

Without power analysis, the experiment may suffer from insufficient sample size, leading to higher Type II errors (failure to detect a true effect) for both AB and AA tests, which undermines the reliability of the overall testing program.

1.2 “Peeking” Fallacy

The term “peeking” (checking results before the experiment is complete) violates the premise of hypothesis testing. Stopping an experiment when the p‑value drops below 0.05 inflates the probability of a Type I error because the probability of observing a minimum p‑value < 0.05 during the whole experiment is actually less than 0.05.

1.3 Unreasonable Null Hypothesis

In practice, the null hypothesis may involve multiple parameters without proper adjustment. For example, Google’s famous “41 shades of blue” experiment required comparing many variants simultaneously; using a simple two‑sample t‑test would dramatically increase the chance of Type I errors. Solutions include using a joint hypothesis with a chi‑square test or applying multiple‑comparison corrections such as Bonferroni.

1.4 Inappropriate Randomization Unit

Using page view (PV) as the randomization unit is easy to implement but can be unsuitable in two scenarios:

Spill‑over effect – When the same user sees both treatment and control versions, heavy users are more likely to be exposed to the treatment, biasing results. Randomizing at the user or user‑group level mitigates this.

Novelty effect – New users react differently to UI changes than existing users. Randomizing by user allows targeting new users to avoid this bias.

2. Pitfalls in Experiment Analysis

2.1 Sample Ratio Mismatch (SRM)

SRM occurs when the observed proportion of traffic in treatment and control deviates from the intended split (e.g., 50:50). Studies show SRM appears in ~6% of Microsoft experiments and similarly in other tech firms.

Detection is straightforward: perform a chi‑square test on the observed group sizes. Example data:

User Count

Ad Revenue

Treatment

15257

Control

15752

Although the revenue difference seems favorable for the treatment, the chi‑square test yields a p‑value of 0.5%, indicating a severe SRM. Ignoring SRM would lead to a misleading t‑test conclusion.

A real Microsoft case showed SRM caused by the anti‑fraud system mistakenly filtering many treatment‑group users as bots. After correcting this, the experiment result reversed.

2.2 Inappropriate Metrics

A complete metric system typically includes:

1) Overall evaluation metric – long‑term business value (e.g., “north star” metric).

2) Guardrail metrics – indicators used to reject failing experiments.

3) Local metrics – reflect specific user behaviors.

Choosing the wrong metric leads to dilemmas: focusing solely on the overall metric may yield inconclusive results, while relying on local metrics can produce conflicting signals (e.g., CTR ↑ while CVR ↓).

2.3 Segmentation Errors (Simpson’s Paradox)

When the segmentation variable is correlated with the randomization (e.g., gender vs. department), aggregated results can contradict subgroup trends, leading to misleading conclusions. Proper randomization aims to eliminate such confounding effects.

Conclusion

The article summarizes common errors in AB testing, ranging from statistical misuse to business‑logic misunderstandings and engineering issues. Any single error can invalidate experimental conclusions, and many more pitfalls exist beyond those listed here.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AB testing Metrics Statistical Analysis experiment design power analysis Sample Ratio Mismatch

Written by

Alimama Tech

Official Alimama tech channel, showcasing all of Alimama's technical innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.