Big Data 22 min read

Vivo Hawking A/B Experiment Platform: Architecture, Practices, and Solutions

The Vivo Hawking platform provides a company‑wide, one‑stop A/B testing solution with a layered architecture, covariate‑balanced split algorithms, real‑time monitoring, and unified SDKs for Android, Java and H5, enabling thousands of daily experiments, automated analysis, and rapid product iteration across multiple departments.

vivo Internet Technology

Nov 16, 2022

Vivo Hawking A/B Experiment Platform: Architecture, Practices, and Solutions

Introduction

The article introduces the Vivo Hawking experiment platform, describing its system architecture, the problems encountered during business development, and the corresponding solutions.

Project Overview

Vivo’s internet products have shifted from a growth‑driven phase to a data‑driven, scientific development model. A/B testing has become a core tool for improving conversion efficiency and accelerating product‑research iteration. Hawking has evolved from a single system into a company‑wide, one‑stop platform that supports large‑scale A/B experiments.

2.1 A/B Experiment

An A/B experiment randomly splits traffic into two groups (A and B) to compare a new version of a page or feature with the old one. The article gives a concrete example where version A achieves a 70% conversion rate versus 50% for version B.

The experiment lifecycle is divided into three stages: pre‑experiment (define goals and metrics), during experiment (allocate traffic), and post‑experiment (evaluate results).

2.2 Layered Experiment Model

Hawking’s layered model follows Google’s “Overlapping Experiment Infrastructure” paper, enabling overlapping experiments with consistent statistical guarantees.

2.3 Platform Development and Business Value

Supports more than 900 daily experiments (peak >1000) across 20+ departments.

Standardized workflow lowers experiment entry barriers and improves efficiency.

Automated data‑analysis tools accelerate decision‑making and product iteration.

Reusable platform components avoid duplicated effort across teams.

3. Hawking System Architecture

The platform consists of several modules:

Experiment Personnel : roles for managing experiments, metrics, and analysis.

Experiment Portal : experiment management and result analysis.

Metric Management : built‑in and custom metrics integrated with the company’s big‑data metric system.

Comparison & Significance : visual components showing uplift, confidence intervals, and significance.

AA Analysis : validates that pre‑experiment groups are balanced on core metrics.

Real‑time Split Monitoring : monitors traffic distribution and allows manual intervention.

Experiment Split Service : provides SDKs for Android, Java, H5 (NGINX), Dubbo/HTTP, with a C++ SDK planned.

Split Methods : random split, targeted audience split, and covariate‑balanced split.

3.4–3.6 Data Services

Split data collection uses a unified data‑capture component and stores processed data in HDFS.

Metric calculation runs in an independent service with retry and alert mechanisms.

Data storage relies on MySQL for business data, Ehcache for configuration cache, Redis for auxiliary cache, and HDFS for experiment data.

4. Hawking Practices

4.1 Covariate Balancing Algorithm

Problem: Simple hash‑mod‑100 splitting can produce groups with uneven covariate distributions, harming statistical validity.

Solution: A three‑part covariate‑balanced algorithm consisting of offline stratified sampling, real‑time uniform grouping, and offline verification.

(1) Offline Stratified Sampling

Define core metrics with business owners.

Apply proportional stratification + K‑means clustering to obtain stratified samples.

Write sampled data into Hive tables.

(2) Real‑time Uniform Grouping

Synchronize stratified sample tables from Hive to Redis (uid → layer mapping, layer‑wise ratios).

Create experiments by linking experiment IDs, group IDs, and sample sizes to the latest layer data.

During split, look up a user’s layer and assign the user uniformly to a group within that layer.

(3) High‑Performance Split Schemes

Three Redis‑based designs were evaluated:

HASH with bucket‑wise sample count (best for 2 buckets, performance degrades linearly).

SORTED SET with bucket scores (stable performance).

HASH with modulo of (layer sample count, bucket size) (stable, 1.12× faster than scheme 2, 58% of single‑GET latency).

Scheme 3 was chosen for production.

High‑Memory User‑Info Storage

Three designs compared; the third (10000 primary hashes each containing 125 secondary hashes) offered the best memory‑usage trade‑off and was selected.

4.2 Java SDK

Early Java SDK only provided split routing, requiring clients to report results, leading to high integration cost and performance bottlenecks (Dubbo thread‑pool exhaustion, network failures). Subsequent upgrades added split result reporting, real‑time configuration updates, self‑monitoring, and fallback mechanisms, dramatically improving stability.

4.3 H5 Experiments

Problems with traditional H5 SDKs: code changes required, page masking, long integration cycles. Hawking’s solution uses an APISIX‑based VUA (Unified Access) layer that automatically injects routing rules via a visual configuration platform, eliminating code changes for the client.

Multi‑version and multi‑page H5 experiments are supported through APISIX plugins that rewrite upstream paths based on experiment configuration.

5. Experiment Effect Analysis

The platform provides metric services, near‑real‑time metric calculation, AA analysis, and visual dashboards for experiment evaluation.

6. Summary and Outlook

The Hawking platform enables a closed loop of experiment creation → data analysis → decision → iteration, offering:

Simple, flexible experiment workflow.

Scientific multi‑layer split algorithms without code releases.

Real‑time split monitoring and hourly metric dashboards.

Custom metric support without waiting for analyst‑built reports.

Future work will focus on improving user experience, simplifying metric configuration, and enhancing interactive data analysis (multi‑dimensional, attribution analysis).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data pipeline Redis Hive scalable architecture Experiment Platform Covariate balancing Java SDK

Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.