Common Pitfalls in AB Testing: Design and Analysis Issues
AB testing often fails because practitioners skip power analysis, peek at interim results, set unrealistic null hypotheses, randomize at inappropriate units, ignore sample‑ratio mismatches, choose misleading metrics, and fall prey to segmentation errors like Simpson’s paradox, any of which can invalidate conclusions.
AB testing is conceptually simple, but correctly applying it in real business scenarios is challenging. Practitioners need a deep understanding of both experimental techniques and business characteristics. This article reviews typical "pitfalls" in AB testing, focusing on experiment design and analysis.
1. Pitfalls in Experiment Design
1.1 Lack of Power Analysis
Power analysis determines the required sample size and is a fundamental step of AB testing. In practice it is often ignored for two reasons:
Missing basic parameters – The three key parameters are significance level (commonly 0.05), statistical power (commonly 0.8), and the minimum detectable effect. While the first two are standard, the minimum effect is highly subjective and varies across business goals, making consensus difficult.
Sample size not controlled by the experimenter – Sample size is usually dictated by business logic (e.g., traffic allocation and experiment duration) rather than by a prior power calculation.
In traffic‑based experiments, sample size depends on traffic proportion and experiment period. Business teams often set a maximum traffic share (e.g., ≤1%) and a fixed duration (e.g., two weeks), limiting the ability to adjust sample size.
In customer‑level experiments, the randomization unit is the customer, and the total eligible customer pool is often small, so increasing sample size by extending the experiment is impossible.
Without power analysis, the experiment may suffer from insufficient sample size, leading to higher Type II errors (failure to detect a true effect) for both AB and AA tests, which undermines the reliability of the overall testing program.
1.2 “Peeking” Fallacy
The term “peeking” (checking results before the experiment is complete) violates the premise of hypothesis testing. Stopping an experiment when the p‑value drops below 0.05 inflates the probability of a Type I error because the probability of observing a minimum p‑value < 0.05 during the whole experiment is actually less than 0.05.
1.3 Unreasonable Null Hypothesis
In practice, the null hypothesis may involve multiple parameters without proper adjustment. For example, Google’s famous “41 shades of blue” experiment required comparing many variants simultaneously; using a simple two‑sample t‑test would dramatically increase the chance of Type I errors. Solutions include using a joint hypothesis with a chi‑square test or applying multiple‑comparison corrections such as Bonferroni.
1.4 Inappropriate Randomization Unit
Using page view (PV) as the randomization unit is easy to implement but can be unsuitable in two scenarios:
Spill‑over effect – When the same user sees both treatment and control versions, heavy users are more likely to be exposed to the treatment, biasing results. Randomizing at the user or user‑group level mitigates this.
Novelty effect – New users react differently to UI changes than existing users. Randomizing by user allows targeting new users to avoid this bias.
2. Pitfalls in Experiment Analysis
2.1 Sample Ratio Mismatch (SRM)
SRM occurs when the observed proportion of traffic in treatment and control deviates from the intended split (e.g., 50:50). Studies show SRM appears in ~6% of Microsoft experiments and similarly in other tech firms.
Detection is straightforward: perform a chi‑square test on the observed group sizes. Example data:
User Count
Ad Revenue
Treatment
15257
16
Control
15752
12
Although the revenue difference seems favorable for the treatment, the chi‑square test yields a p‑value of 0.5%, indicating a severe SRM. Ignoring SRM would lead to a misleading t‑test conclusion.
A real Microsoft case showed SRM caused by the anti‑fraud system mistakenly filtering many treatment‑group users as bots. After correcting this, the experiment result reversed.
2.2 Inappropriate Metrics
A complete metric system typically includes:
1) Overall evaluation metric – long‑term business value (e.g., “north star” metric).
2) Guardrail metrics – indicators used to reject failing experiments.
3) Local metrics – reflect specific user behaviors.
Choosing the wrong metric leads to dilemmas: focusing solely on the overall metric may yield inconclusive results, while relying on local metrics can produce conflicting signals (e.g., CTR ↑ while CVR ↓).
2.3 Segmentation Errors (Simpson’s Paradox)
When the segmentation variable is correlated with the randomization (e.g., gender vs. department), aggregated results can contradict subgroup trends, leading to misleading conclusions. Proper randomization aims to eliminate such confounding effects.
Conclusion
The article summarizes common errors in AB testing, ranging from statistical misuse to business‑logic misunderstandings and engineering issues. Any single error can invalidate experimental conclusions, and many more pitfalls exist beyond those listed here.
Alimama Tech
Official Alimama tech channel, showcasing all of Alimama's technical innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.