Fundamentals 12 min read

Ensuring Trustworthy A/B Experiments: Architecture, Balance Checks, Log Consistency, Automated Significance Testing, and Result Interpretation

This article discusses how to improve the reliability of online A/B experiments by designing robust architecture, evaluating group balance with orthogonal testing, ensuring consistent front‑end/back‑end logging, automating statistical significance checks, reducing group imbalance, and interpreting results using causal trees.

Dada Group Technology

Dec 30, 2022

Ensuring Trustworthy A/B Experiments: Architecture, Balance Checks, Log Consistency, Automated Significance Testing, and Result Interpretation

Background: A/B experiments are essential for internet companies, and their credibility directly impacts operational costs; Dada emphasizes this early in its experimentation system development, integrating statistical theory and machine learning.

What is A/B testing: It is a randomized controlled experiment that splits users into mutually exclusive groups, applies a single intervention, and enables accurate causal inference.

We focus on three reliability aspects: (1) traffic grouping – ensuring balanced random splits for comparability; (2) data collection – avoiding loss or inconsistency of group and business data; (3) data analysis – preventing unreliable conclusions by performing significance and heterogeneity analyses.

Architecture: Dada has built a comprehensive A/B experimentation platform covering the three stages and integrating with other marketing systems.

Problem 1 – Checking group algorithm balance: We assess orthogonality of group assignments across multiple experiments using chi‑square tests on millions of user IDs hashed with MD5, SHA256, CityHash64, and SpookyHash, finding no significant imbalance and confirming MD5’s adequacy.

Problem 2 – Front‑end/back‑end log inconsistency: Network failures cause mismatched group assignments; we mitigated this by moving the group‑assignment request to app cold‑start, adding retries and caching, which reduced inconsistency from >10% to ~0.2%.

Problem 3 – Automated significance testing: We parse SQL into abstract syntax trees (using Druid’s parser) to auto‑generate metrics, translate SQL to Elasticsearch DSL when needed, and apply appropriate statistical tests (Welch’s t‑test, large‑sample proportion test) for each metric.

Problem 4 – Reducing group imbalance: For small‑sample or high‑value user groups, we run A/A experiments, exclude head‑users, or use low‑variance transformed metrics to lower variance and achieve more balanced splits.

Problem 5 – Interpreting experiment results: We employ causal trees (a variant of decision trees) to identify sub‑populations with heterogeneous treatment effects, using AUUC to balance performance and interpretability.

References: Ron Kohavi and Diane Tang’s work on reliable online experiments, Chen Xiru’s probability and statistics textbook, plus additional papers on online experimentation and heterogeneous treatment effect estimation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data collection A/B testing Statistical Analysis experiment design causal trees significance testing

Written by

Dada Group Technology

Sharing insights and experiences from Dada Group's R&D department on product refinement and technology advancement, connecting with fellow geeks to exchange ideas and grow together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.