Understanding Online Experiments: Origins, Development, Types, and Applications
Online experiments, rooted in biomedical randomized controlled trials, have become essential for internet businesses to achieve data‑driven growth by providing causal inference, quantifying value, and managing risk through various designs such as AB, ABn, AA, multivariate and quasi‑experimental tests.
In internet business, "growth" is an eternal theme. As the era of abundant traffic diminishes, achieving effective growth under invisible strategy effects becomes a major challenge for online companies. Online experiments have become a crucial measurement tool since Google applied experimental techniques to its products in 2000, and they are now indispensable for strategy validation, product iteration, algorithm optimization, and risk control.
One accurate measurement is worth more than a thousand expert opinions – Admiral Grace Hopper
Alibaba’s advertising division has accumulated extensive practice and technical expertise in online experiments. The following series will share this knowledge, covering topics such as "Understanding Online Experiments", "AB Testing under Traffic Splitting Framework", and "AB Testing under Offline Sampling Framework".
1. What Is an Online Experiment
1.1 Origin
The concept of AB testing originates from the randomized controlled trial (RCT) used in biomedical research to evaluate drug efficacy. RCTs randomize subjects into groups, apply different treatments, and compare outcomes, thereby minimizing bias and confounding factors. This statistical foundation makes RCTs the gold standard for causal inference.
1.2 Development
With the rise of the internet, the RCT methodology has been widely adopted for product optimization. After the traffic‑growth era, user and value growth slowed, making data‑driven growth essential. The internet makes experiments easier because:
Sample size : Online platforms can collect millions of samples, eliminating the small‑sample limitations of traditional experiments.
Experiment cost : Computing resources allow near‑zero cost per sample, enabling large‑scale experiments.
Control of confounding factors : Precise version control ensures that all users experience the same environment, preserving causal validity.
2. Why Conduct Experiments
2.1 Causal Inference
AB testing serves as the golden rule for causal inference, providing direct evidence of whether a variable causes an effect, unlike many theoretical causal methods that rely on assumptions.
2.2 Value Evaluation
Beyond qualitative conclusions, AB tests quantify the impact of a change (e.g., revenue increase, user growth, efficiency gains), enabling data‑driven decision making.
2.3 Risk Control
Every optimization carries risk. Small‑traffic AB tests allow teams to estimate the risk of large‑scale rollouts, offering a safe trial‑and‑error mechanism.
3. Classification of Experiments
3.1 AB Test
3.1.1 AB Experiment
An AB experiment is a controlled randomized test that compares two versions of a single variable using a two‑sample hypothesis test. For example, a button’s color (A vs. B) is randomly shown to users, and click‑through rate (CTR) is measured.
Typical AB experiment workflow:
Define the objective (e.g., increase button clicks, metric = CTR).
Choose the randomization unit (user ID, page view, etc.).
Determine sample size via power analysis, balancing exposure proportion and experiment duration.
Collect data and perform hypothesis testing (usually a two‑sample t‑test or z‑test for large samples).
3.1.2 ABn Experiment
ABn experiments compare multiple versions of a single variable (e.g., several button colors). The hypothesis testing can involve:
Two‑sample tests for pairwise comparisons.
Multi‑sample tests (e.g., chi‑square or ANOVA) when more than two groups are involved.
3.1.3 AA Experiment
An AA experiment compares two identical versions to validate the experimental setup. Significant differences indicate either a Type I error (expected ~5%) or flaws in randomization.
3.1.4 Multivariate Test (MVT)
MVT evaluates multiple variables simultaneously (e.g., button color and text). It can test interaction hypotheses such as whether color and text effects are independent.
3.2 Quasi‑Experiments (Class Experiments)
3.2.1 Definition
When randomization or parallel control groups are infeasible, quasi‑experiments (or “class experiments”) are used. They still rely on controlled comparisons but may lack full random assignment.
3.2.2 Characteristics
Key traits include non‑random grouping, large sample sizes, and often the use of internal or self‑controls.
3.2.3 Common Designs
Self‑pre/post control : Compare the same subjects before and after an intervention.
Pre/post with groups : Randomly assign subjects to treatment and control, then compare outcomes (often using Difference‑in‑Differences).
Post‑only group control : Compare treated subjects with a contemporaneous control when pre‑intervention data are unavailable.
Solomon four‑group design : Combines pre/post and group controls, though rarely used due to complexity.
4. Summary
Online experimentation has evolved into an independent discipline that bridges statistical theory and internet technology. AB testing exemplifies how large‑scale data and computing power enable rigorous causal inference for product decisions. Future articles will delve into specific Alibaba‑Mama scenarios, covering standard AB tests and specialized experimental designs.
Alimama Tech
Official Alimama tech channel, showcasing all of Alimama's technical innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.