Fundamentals 9 min read

A/B Testing Platform Overview and Statistical Evaluation Methods

This article introduces the A/B testing platform used at AutoHome, detailing its architecture, experiment flow, traffic allocation strategies, and statistical evaluation techniques such as hypothesis testing, confidence intervals, and significance testing, to guide data‑driven decision making for recommendation system improvements.

HomeTech
HomeTech
HomeTech
A/B Testing Platform Overview and Statistical Evaluation Methods

Introduction As the business of AutoHome expands, diverse scenarios demand data‑driven decisions; the recommendation system relies on refined strategy iteration, and A/B experiments provide objective, data‑based evaluation of strategy effectiveness.

A/B Platform Overview The platform supports various algorithm and engineering experiment scenarios, offering experiment management, traffic splitting, and effect evaluation. It enables goal setting, standardized cycles, automatic scaling, and reduces low‑quality or unused experiments.

Experiment Flow The iterative process includes discovering optimization directions, hypothesis formulation, experimentation, hypothesis validation, and deployment, applied primarily in algorithms and engineering.

Architecture The platform consists of experiment management, traffic splitting mechanisms, and effect evaluation systems.

Experiment Splitting Traffic is allocated using hash functions. Horizontal experiments use deviceid+layerid , vertical experiments may use fixed deviceid , deviceid+day , or random bsdata . This ensures independent hash calculations across layers.

Experiment Types Horizontal experiments are mutually exclusive within a layer; vertical experiments are orthogonal across layers, allowing flow reuse; conditional experiments require additional criteria (e.g., version, region) before entering a layer.

Traffic Allocation Improvements Initially, each experiment occupied a layer with flexible bucket creation, leading to management challenges. After optimization, each experiment is limited to one bucket and one control bucket with equal traffic, multiple experiments per layer are allowed, and a "pure" bucket handles unused traffic.

Scientific Effect Evaluation Evaluation relies on statistical hypothesis testing. Null hypothesis H 0 assumes no difference between A and B; alternative hypothesis H 1 asserts a difference. Type I error (α) rejects a true H 0 , while Type II error (β) accepts a false H 0 . Reducing these errors involves increasing confidence level and statistical power.

The common tests include Z‑test, T‑test, and chi‑square test. Z‑ and T‑tests compare means; choice depends on sample size and variance knowledge. P‑value ≤ α (commonly 0.05) indicates statistical significance.

Confidence intervals are derived from the chosen confidence level (1‑α). For two independent samples, the interval formula is shown, and a 95% confidence interval that is entirely positive or negative indicates a directional effect; crossing zero indicates non‑significance.

Results and Outlook A/B testing is widely used at AutoHome, with over 2000 experiments and 50+ configurable metrics. Future work includes enriching custom metric libraries, adding effect size dimensions, and expanding scenario support.

Author Bio Zhao Xinyuan, Intelligent Recommendation Department, AutoHome, joined in 2020 and works on recommendation engineering.

A/B testingexperiment platformRecommendation systemsstatistical analysisdata-driven decisions
HomeTech
Written by

HomeTech

HomeTech tech sharing

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.