How Volcano Engine DataTester Handles Private Deployment: Architecture, Challenges, and Business‑Driven Solutions
This article details Volcano Engine DataTester's private deployment architecture, the version‑management, performance, and stability challenges encountered, and the business‑oriented solutions—including branch strategies, pipeline automation, ClickHouse model optimizations, and multi‑level caching—that enable reliable, efficient A/B testing in on‑premise environments.
Private Deployment Architecture
The DataTester product, targeting the B2B market, adopts an Ansible + Bash build process to support small‑cluster private deployments. The system is divided into three logical parts:
Business Services : user‑facing functions such as experiment management, reporting, OpenAPI, and data ingestion.
Infrastructure Services : back‑end engines that compute reports and provide metadata, abstracting differences between SaaS (real‑time + offline Lambda architecture) and private deployments (real‑time only).
Infrastructure : a unified private‑cloud base called minibase that combines bare‑metal hosts with Kubernetes, exposing a consistent interface to upper‑layer services.
Challenge 1: Version Management
Unlike SaaS, which updates a single codebase weekly, private deployments require a baseline version and synchronized sub‑versions for each service to guarantee environment parity. Baseline releases occur bi‑monthly.
To avoid concentrating effort during release windows, the team restructured branch logic and pipelines:
Branch Logic : Both SaaS and private builds originate from master. During a private release cycle, a dedicated private branch is created and merged back after release, ensuring master remains functional for both environments.
Release Pipeline : A pre‑release environment mirrors both SaaS and private clusters. Merge requests to master trigger automated regression tests in both environments, spreading testing effort across the feature development phase.
Challenge 2: Performance Optimization
DataTester’s reporting relies on ClickHouse for real‑time analysis. SaaS benefits from large, multi‑tenant clusters, while private deployments run on small, isolated clusters where experiment counts range from a few to hundreds, causing noticeable latency spikes.
Solution 1 – Experiment Report Framework : The report requires date ranges, filter conditions, selected metrics, experiment/control versions, and report type (e.g., multi‑day cumulative, single‑day trend). Metrics are defined by user events, attributes, and built‑in operators, combined with arithmetic operators.
Solution 2 – Model Optimization : Exposure events were originally stored with regular events, causing large tables, costly first‑record scans, and duplicate reports. The team moved exposure flags into a user‑level table, eliminating time‑based growth and reducing joins. Tests showed a >50 % reduction in query time for 14‑day cumulative reports.
Solution 3 – Pre‑aggregation : Resource estimation uses daily active users and daily event volume. By scanning the day’s raw events once to build a user_agg table (1/100–1/500 of original size), most metric calculations can be performed on this compact table, handling >80 % of report metrics while supporting user‑level filters and time‑range switches. The approach scales: as experiment count grows, the benefit of the aggregated table increases; with few experiments, the overhead may outweigh gains.
Challenge 3: Stability
Private services face complex operational channels and higher availability demands, especially for the traffic‑splitting component that determines which version a user sees.
The splitting service employs a three‑tier storage hierarchy: in‑process memory, Redis cache, and a relational database. Configuration changes are written to a message queue; the service consumes the queue to update memory and Redis, ensuring consistency across nodes. An auxiliary goroutine periodically performs a full refresh as a fallback. Redis acts as a hot‑standby for MySQL, allowing the service to recover after restarts without losing the latest split configuration.
Conclusion
Volcano Engine DataTester, originating from ByteDance’s internal tooling, has incorporated extensive B‑to‑B experimentation experience. The private‑deployment journey—from version control and performance tuning to stability engineering—illustrates how a SaaS‑born product can mature into a robust on‑premise solution that delivers consistent A/B testing value for external customers.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Past Memory Big Data
A popular big-data architecture channel with over 100,000 developers. Publishes articles on Spark, Hadoop, Flink, Kafka and more. Visit the Past Memory Big Data blog at https://www.iteblog.com. Search "Past Memory" on Google or Baidu.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
