Big Data 12 min read

Five Optimization Strategies for Improving DataTester Query Performance

This article describes how DataTester, Volcano Engine's A/B testing platform, achieved over four‑fold query speedup by applying five technical optimizations—including pre‑aggregation, join reduction, GroupBy redesign, AU‑metric caching, and asynchronous query handling—targeting both data construction and execution layers.

DataFunTalk
DataFunTalk
DataFunTalk
Five Optimization Strategies for Improving DataTester Query Performance

DataTester is an A/B testing platform launched by Volcano Engine that supports recommendation, advertising, search, UI, and product‑function scenarios, offering a complete lifecycle from experiment design to final evaluation. After years of internal use at ByteDance, it has executed more than 1.5 million experiments.

The metric‑query module is a major performance bottleneck; the article shares five concrete optimization ideas that dramatically improve query speed while keeping resource usage low.

Challenge 1 – Version Management : The report page involves multiple systems (Console, scientific‑computing module, query engine, OLAP storage) and a long processing chain that converts DSL to SQL, caches results, performs calculations, and renders data on the front end. The query logic is complex, with custom SQL for each operator.

Example SQL used in a typical A/B test query:

printf("hello world!");SELECT event_date,
    count(DISTINCT uc1) AS uv,
    sum(value) AS sum_value,
    sum(pow(value, 2)) AS sum_value_square
FROM (
    SELECT uc1,
        event_date,
        count(s) AS value
    FROM (
        SELECT hash_uid AS uc1,
            TIME,
            server_time,
            event,
            event_date,
            TIME AS s
        FROM rangers.tob_apps_all et
        WHERE tea_app_id = 249532
            AND event = 'purchase'
            AND event_date >= '2021-05-10' AND event_date <= '2021-05-19'
            AND multiIf(server_time < 1609948800, server_time, TIME > 2000000000, toUInt32(TIME/1000), TIME) >= 1620576000
            AND multiIf(server_time < 1609948800, server_time, TIME > 2000000000, toUInt32(TIME/1000), TIME) <= 1621439999
            AND event IN ('rangers_push_send','rangers_push_workflow')
            AND ifNull(string_params{'$inactive'},'null') != 'true'
    ) GLOBAL ANY
    INNER JOIN (
        SELECT min(multiIf(server_time < 1609948800, server_time, TIME > 2000000000, toUInt32(TIME/1000), TIME)) AS first_time,
            hash_uid AS uc2
        FROM rangers.tob_apps_all et
        WHERE tea_app_id = 249532
            AND arraySetCheck(ab_version, (29282))
            AND event_date >= '2021-05-10' AND event_date <= '2021-05-19'
            AND multiIf(server_time < 1609948800, server_time, TIME > 2000000000, toUInt32(TIME/1000), TIME) >= 1620651351
            AND multiIf(server_time < 1609948800, server_time, TIME > 2000000000, toUInt32(TIME/1000), TIME) <= 1621439999
            AND event IN ('rangers_push_send','rangers_push_workflow')
            AND ifNull(string_params{'$inactive'},'null') != 'true'
        GROUP BY uc2
    ) tab ON et.uc1 = tab.uc2
WHERE multiIf(server_time < 1609948800, server_time, TIME > 2000000000, toUInt32(TIME/1000), TIME) >= first_time
    AND first_time > 0
GROUP BY uc1, event_date
GROUP BY event_date

DataTester’s underlying OLAP engine is ClickHouse, which guides two main optimization directions: reducing ClickHouse joins and shrinking the right‑hand table size during joins.

Solution 1 – Pre‑aggregation : By compressing the amount of data scanned, daily pre‑computed aggregates are stored per user, drastically reducing query volume. The pipeline consists of Dump → Parse → Build → Query, with resource‑controlled Spark jobs (driver‑memory 4 GB, executor‑memory 2 GB, executor‑cores 2). This approach yields more than a 4× speedup on 400 M events and 1 M users.

Solution 2 – ab_log (join size reduction) : Real‑time exposure events are filtered to create a daily user‑join table containing only the earliest join time per user. Queries then take the minimum of the real‑time and offline join times, cutting event volume by over 30 % in some customers.

Solution 3 – GroupBy query optimization : Since every experiment version runs the same query with only a different version ID, the queries are refactored to share GroupBy logic, reducing the number of scans and improving throughput on a workload of 100 M daily events across three versions.

Solution 4 – AU‑metric caching : AU‑type operators previously issued two SQL statements per metric; caching the join‑user count eliminates redundant queries, speeding up report page loads.

Solution 5 – Asynchronous query : To avoid gateway timeouts, the front‑end and back‑end interaction was redesigned with async processing and Redis caching, bringing repeated page refresh latency down to ~100 ms.

Summary : The metric‑query component is a core competitive advantage of DataTester. The five techniques—pre‑aggregation, ab_log, GroupBy redesign, AU‑metric caching, and async queries—address data construction, concurrency reduction, and user‑experience improvements, delivering multi‑fold performance gains while keeping resource consumption low.

performanceBig DataQuery OptimizationClickHouseA/B testingDataTester
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.