Improving Low-Response AB Experiments via Propensity Score Matching and Instrumental Variable Methods
The paper tackles low-response A/B tests by applying instrumental-variable techniques and optimized propensity-score matching, showing that IV methods recover treatment effects for compliant users and that a refined PSM pipeline dramatically boosts lift detection, turning previously non-significant results into statistically significant business insights.
This article addresses the problem of low-response AB experiments, where only a small fraction of users are actually exposed to the treatment, leading to diluted business impact and reduced statistical power.
It defines low-response experiments, explains the two types of response funnels—"experiment capability" (leakage due to engineering or design) and "user choice" (leakage due to user behavior)—and discusses how each affects business and statistical significance.
The core insight is that the observed incremental effect can be estimated more accurately by accounting for these funnels. Simple instrumental‑variable (IV) estimation using the random bucket assignment as a perfect IV can recover the treatment effect for the "experiment capability" funnel, while acknowledging that the estimate corresponds to the local average treatment effect (LATE) of compliant users.
Two IV approaches are presented: (1) a basic IV that ignores covariates, which improves business significance but does not enhance statistical efficiency; (2) an IV that incorporates covariates, which can increase efficiency if suitable covariates satisfying the exclusion restriction are available.
Propensity score matching (PSM) is introduced as an alternative method. After estimating propensity scores (e.g., via logistic regression), users are matched using nearest‑neighbour, caliper/radius, or stratification techniques. The method relies on the Conditional Independence Assumption and Common Support, and it yields the average treatment effect for the matched subset of users.
Empirical evaluation on simulated data and two real online experiments shows that standard PSM tends to over‑estimate the lift because high‑activity users are harder to match. An optimized PSM pipeline—dropping low‑propensity users and performing cluster‑based matching within the control group—reduces bias and improves statistical significance.
The optimized PSM results are compared with the baseline AB analysis: the lift increases from 0.049 pt (2/10 significant days) to 0.288 pt (9/10 significant days), turning a non‑significant result into a significant one.
In summary, (1) basic IV estimation can partially resolve business insignificance; (2) adding covariates to IV can further improve statistical power when appropriate covariates exist; (3) optimized propensity‑score matching offers a practical, though slightly biased, solution that substantially enhances detection of incremental effects.
The proposed methods have been integrated into the internal evaluation platform, and the team invites collaboration for further improvements.
DaTaobao Tech
Official account of DaTaobao Technology
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.