Hera Data Service Architecture and Core Functions at Vipshop
The article introduces Vipshop's self‑developed Hera data service, detailing its background, three‑tier architecture, scheduling workflow, multi‑engine query capabilities, adaptive execution, SQL generation, resource isolation, metrics collection, and performance improvements such as Alluxio caching for large‑scale crowd computing.
Data services are a key component of a data‑middle‑platform, acting as a unified entry point for data warehouse access and providing standardized APIs for data inflow and outflow, thereby meeting diverse data access needs.
Vipshop began building its own data service, Hera, in 2019. Since then, Hera has evolved from scratch to serving more than 30 business units with both B2B and B2C data services.
Background
Before a unified data service, the data warehouse suffered from low efficiency, inconsistent metrics, and duplicated interfaces for each engine (Presto, ClickHouse, etc.). These issues caused long extraction times for large advertising audiences, a proliferation of APIs, and inconsistent metric definitions.
Data service was created to address these problems by abstracting storage and compute engines, offering a single API, layered storage, adaptive SQL generation, unified caching, and SLA‑driven performance.
Architecture Design
Hera follows a classic master/slave model with separate data and control flows for high availability. It consists of three layers:
Application Access Layer – provides TCP client, HTTP, and internal RPC (OSP) interfaces for business requests.
Data Service Layer – handles routing, multi‑engine support, resource configuration, dynamic engine‑parameter assembly, SQL LispEngine generation, adaptive execution, unified query cache, and FreeMarker‑based SQL generation.
Data Layer – transparently accesses data stored in the warehouse, ClickHouse, MySQL, Redis, etc., through a single API.
Scheduling system consists of Master, Worker, Client, ConfigCenter, and TransferServer modules, managing job distribution, ETL execution, and file transfer.
Main Functions
Multi‑queue scheduling: different users and task types are assigned weighted queues to meet SLA requirements.
Multi‑engine query: supports Spark, Presto, ClickHouse, Hive, MySQL, Redis, selecting the optimal engine per scenario.
Multiple task types: ETL, adhoc, file export, data import, enabling combinations like Spark‑adhoc and Presto‑adhoc.
File export: large‑scale data export to HDFS/Alluxio with TCP download, reducing export time from 30+ minutes to under 3 minutes for massive audience datasets.
Resource isolation: separates worker and engine resources for core vs. non‑core workloads.
Dynamic engine‑parameter assembly: automatically builds and adjusts engine parameters per task, user, and business type.
Adaptive engine execution: if the chosen engine fails, the system switches to another engine to ensure query success.
SQL construction: supports single‑table, star‑schema, and snowflake models for dimensional modeling.
Task Scheduling
Uses Netty for cluster messaging, separating network I/O from business logic via distinct thread pools. Multi‑queue and multi‑user scheduling consider queue weight, dynamic factors (queue size/capacity, running tasks), and job weight (based on timeout) to compute a final score for dispatch.
Score = job weight + queue dynamic factor + queue weight.
SQL Job Flow
Clients submit raw SQL (e.g., Presto). The SQLParser rewrites it for target engines (Spark, Presto, ClickHouse) and returns all possible rewrites. The Master schedules the job to Workers, which execute on the chosen engine; if execution fails, another engine is tried. Results are sent directly to the client, and the job lifecycle progresses through NEW → WAITING → RUNNING → SUCCESS/FAIL.
Metrics Collection
Hera collects static metrics (master/worker/client info) and dynamic metrics (runtime memory usage, task queue snapshots). Workers report heartbeats with memory usage, enabling the Master to make informed scheduling decisions.
Performance Improvements
Hera addresses SLA issues for crowd computing, data migration, and data product reliability. By co‑locating compute and storage, reducing HDFS hotspots, and leveraging Alluxio caching, crowd‑computing tasks see a 10‑30% speedup.
Alluxio Cache Table Synchronization
Hive table locations are switched from HDFS to Alluxio paths, creating cache tables that pull data from HDFS on demand. A periodic task detects new partitions and triggers a SYN2ALLUXIO job to add matching partitions to the Alluxio table, keeping data synchronized.
Crowd Computing Tasks
When the underlying table is cached in Alluxio, Hera uses the Alluxio table for crowd calculations, achieving a 10‑30% performance gain thanks to data locality and isolated compute resources.
Conclusion
Hera now supports many production workloads, but challenges remain, such as handling inconsistent function signatures across engines (e.g., Presto vs. ClickHouse) and further optimizing crowd‑computing with industry‑standard ClickHouse Bitmap solutions.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.