Optimization of A/B Test Metric Computation Using Spark and ClickHouse
This article details the design and multi‑stage optimization of an A/B testing metric system, describing its product architecture, Spark‑based computation engine, ClickHouse OLAP layer, cumulative calculation improvements, and batch processing techniques that reduced processing time from hours to a few minutes for hundreds of experiments and metrics.
Introduction
A/B testing is a data‑driven method that splits traffic to run multiple product versions simultaneously, records user behavior, and compares metrics to support scientific product decisions.
Metric Product Design
The metric system uses a registration approach where users define metrics with SQL formulas and optional custom dimensions; the analysis layer provides both pre‑computed and on‑demand multi‑dimensional queries.
Metric Technical Architecture
The platform employs Spark as the core computation engine for its performance and maturity, and ClickHouse as the OLAP engine for fast multi‑dimensional analysis of detailed data.
Initially, 10+ experiments and 50+ metrics required 2–3 hours of processing; after six months, the workload grew to 10‑parallel experiments and 100‑core resources, prompting optimization.
Stage 1: Engine and Architecture Optimization
Adopted Spark for batch jobs and ClickHouse for analytical queries, enabling parallel execution of multiple experiments while keeping metric calculations within each experiment serial.
Stage 2: Cumulative Calculation Model Optimization
Replaced the original model that scanned all historical data for each cumulative metric with a new model that builds daily aggregates incrementally, improving performance and accuracy.
Stage 3: Rate Metric Batch Optimization
Implemented batch processing for rate metrics that share the same SQL definition across experiments, reducing total runtime to about 5 hours for 150+ experiments and 600+ metrics.
Stage 4: Mean Metric Batch Optimization
For complex mean metrics, introduced Spark checkpointing and a hybrid Spark‑ClickHouse workflow that caches intermediate detail data, achieving further speed‑ups despite increased complexity.
Conclusion
After the four‑stage optimization, the system now handles over 150 experiments and 600 metrics with a stable processing time of 2–3 hours, demonstrating scalable and controllable performance as the workload grows.
TAL Education Technology
TAL Education is a technology-driven education company committed to the mission of 'making education better through love and technology'. The TAL technology team has always been dedicated to educational technology research and innovation. This is the external platform of the TAL technology team, sharing weekly curated technical articles and recruitment information.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.