Five Optimization Strategies for Improving DataTester Query Performance
This article describes how DataTester, Volcano Engine's A/B testing platform, achieved over four‑fold query speedup by applying five technical optimizations—including pre‑aggregation, join reduction, GroupBy redesign, AU‑metric caching, and asynchronous query handling—targeting both data construction and execution layers.
DataTester is an A/B testing platform launched by Volcano Engine that supports recommendation, advertising, search, UI, and product‑function scenarios, offering a complete lifecycle from experiment design to final evaluation. After years of internal use at ByteDance, it has executed more than 1.5 million experiments.
The metric‑query module is a major performance bottleneck; the article shares five concrete optimization ideas that dramatically improve query speed while keeping resource usage low.
Challenge 1 – Version Management : The report page involves multiple systems (Console, scientific‑computing module, query engine, OLAP storage) and a long processing chain that converts DSL to SQL, caches results, performs calculations, and renders data on the front end. The query logic is complex, with custom SQL for each operator.
Example SQL used in a typical A/B test query:
printf("hello world!");SELECT event_date,
count(DISTINCT uc1) AS uv,
sum(value) AS sum_value,
sum(pow(value, 2)) AS sum_value_square
FROM (
SELECT uc1,
event_date,
count(s) AS value
FROM (
SELECT hash_uid AS uc1,
TIME,
server_time,
event,
event_date,
TIME AS s
FROM rangers.tob_apps_all et
WHERE tea_app_id = 249532
AND event = 'purchase'
AND event_date >= '2021-05-10' AND event_date <= '2021-05-19'
AND multiIf(server_time < 1609948800, server_time, TIME > 2000000000, toUInt32(TIME/1000), TIME) >= 1620576000
AND multiIf(server_time < 1609948800, server_time, TIME > 2000000000, toUInt32(TIME/1000), TIME) <= 1621439999
AND event IN ('rangers_push_send','rangers_push_workflow')
AND ifNull(string_params{'$inactive'},'null') != 'true'
) GLOBAL ANY
INNER JOIN (
SELECT min(multiIf(server_time < 1609948800, server_time, TIME > 2000000000, toUInt32(TIME/1000), TIME)) AS first_time,
hash_uid AS uc2
FROM rangers.tob_apps_all et
WHERE tea_app_id = 249532
AND arraySetCheck(ab_version, (29282))
AND event_date >= '2021-05-10' AND event_date <= '2021-05-19'
AND multiIf(server_time < 1609948800, server_time, TIME > 2000000000, toUInt32(TIME/1000), TIME) >= 1620651351
AND multiIf(server_time < 1609948800, server_time, TIME > 2000000000, toUInt32(TIME/1000), TIME) <= 1621439999
AND event IN ('rangers_push_send','rangers_push_workflow')
AND ifNull(string_params{'$inactive'},'null') != 'true'
GROUP BY uc2
) tab ON et.uc1 = tab.uc2
WHERE multiIf(server_time < 1609948800, server_time, TIME > 2000000000, toUInt32(TIME/1000), TIME) >= first_time
AND first_time > 0
GROUP BY uc1, event_date
GROUP BY event_dateDataTester’s underlying OLAP engine is ClickHouse, which guides two main optimization directions: reducing ClickHouse joins and shrinking the right‑hand table size during joins.
Solution 1 – Pre‑aggregation : By compressing the amount of data scanned, daily pre‑computed aggregates are stored per user, drastically reducing query volume. The pipeline consists of Dump → Parse → Build → Query, with resource‑controlled Spark jobs (driver‑memory 4 GB, executor‑memory 2 GB, executor‑cores 2). This approach yields more than a 4× speedup on 400 M events and 1 M users.
Solution 2 – ab_log (join size reduction) : Real‑time exposure events are filtered to create a daily user‑join table containing only the earliest join time per user. Queries then take the minimum of the real‑time and offline join times, cutting event volume by over 30 % in some customers.
Solution 3 – GroupBy query optimization : Since every experiment version runs the same query with only a different version ID, the queries are refactored to share GroupBy logic, reducing the number of scans and improving throughput on a workload of 100 M daily events across three versions.
Solution 4 – AU‑metric caching : AU‑type operators previously issued two SQL statements per metric; caching the join‑user count eliminates redundant queries, speeding up report page loads.
Solution 5 – Asynchronous query : To avoid gateway timeouts, the front‑end and back‑end interaction was redesigned with async processing and Redis caching, bringing repeated page refresh latency down to ~100 ms.
Summary : The metric‑query component is a core competitive advantage of DataTester. The five techniques—pre‑aggregation, ab_log, GroupBy redesign, AU‑metric caching, and async queries—address data construction, concurrency reduction, and user‑experience improvements, delivering multi‑fold performance gains while keeping resource consumption low.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.