Big Data 18 min read

Hera Data Service Architecture and Core Functions at Vipshop

The article introduces Vipshop's self‑developed Hera data service, detailing its background, three‑tier architecture, scheduling workflow, multi‑engine query capabilities, adaptive execution, SQL generation, resource isolation, metrics collection, and performance improvements such as Alluxio caching for large‑scale crowd computing.

Architects' Tech Alliance

Aug 8, 2021

Hera Data Service Architecture and Core Functions at Vipshop

Data services are a key component of a data‑middle‑platform, acting as a unified entry point for data warehouse access and providing standardized APIs for data inflow and outflow, thereby meeting diverse data access needs.

Vipshop began building its own data service, Hera, in 2019. Since then, Hera has evolved from scratch to serving more than 30 business units with both B2B and B2C data services.

Background

Before a unified data service, the data warehouse suffered from low efficiency, inconsistent metrics, and duplicated interfaces for each engine (Presto, ClickHouse, etc.). These issues caused long extraction times for large advertising audiences, a proliferation of APIs, and inconsistent metric definitions.

Data service was created to address these problems by abstracting storage and compute engines, offering a single API, layered storage, adaptive SQL generation, unified caching, and SLA‑driven performance.

Architecture Design

Hera follows a classic master/slave model with separate data and control flows for high availability. It consists of three layers:

Application Access Layer – provides TCP client, HTTP, and internal RPC (OSP) interfaces for business requests.

Data Service Layer – handles routing, multi‑engine support, resource configuration, dynamic engine‑parameter assembly, SQL LispEngine generation, adaptive execution, unified query cache, and FreeMarker‑based SQL generation.

Data Layer – transparently accesses data stored in the warehouse, ClickHouse, MySQL, Redis, etc., through a single API.

Scheduling system consists of Master, Worker, Client, ConfigCenter, and TransferServer modules, managing job distribution, ETL execution, and file transfer.

Main Functions

Multi‑queue scheduling: different users and task types are assigned weighted queues to meet SLA requirements.

Multi‑engine query: supports Spark, Presto, ClickHouse, Hive, MySQL, Redis, selecting the optimal engine per scenario.

Multiple task types: ETL, adhoc, file export, data import, enabling combinations like Spark‑adhoc and Presto‑adhoc.

File export: large‑scale data export to HDFS/Alluxio with TCP download, reducing export time from 30+ minutes to under 3 minutes for massive audience datasets.

Resource isolation: separates worker and engine resources for core vs. non‑core workloads.

Dynamic engine‑parameter assembly: automatically builds and adjusts engine parameters per task, user, and business type.

Adaptive engine execution: if the chosen engine fails, the system switches to another engine to ensure query success.

SQL construction: supports single‑table, star‑schema, and snowflake models for dimensional modeling.

Task Scheduling

Uses Netty for cluster messaging, separating network I/O from business logic via distinct thread pools. Multi‑queue and multi‑user scheduling consider queue weight, dynamic factors (queue size/capacity, running tasks), and job weight (based on timeout) to compute a final score for dispatch.

Score = job weight + queue dynamic factor + queue weight.

SQL Job Flow

Clients submit raw SQL (e.g., Presto). The SQLParser rewrites it for target engines (Spark, Presto, ClickHouse) and returns all possible rewrites. The Master schedules the job to Workers, which execute on the chosen engine; if execution fails, another engine is tried. Results are sent directly to the client, and the job lifecycle progresses through NEW → WAITING → RUNNING → SUCCESS/FAIL.

Metrics Collection

Hera collects static metrics (master/worker/client info) and dynamic metrics (runtime memory usage, task queue snapshots). Workers report heartbeats with memory usage, enabling the Master to make informed scheduling decisions.

Performance Improvements

Hera addresses SLA issues for crowd computing, data migration, and data product reliability. By co‑locating compute and storage, reducing HDFS hotspots, and leveraging Alluxio caching, crowd‑computing tasks see a 10‑30% speedup.

Alluxio Cache Table Synchronization

Hive table locations are switched from HDFS to Alluxio paths, creating cache tables that pull data from HDFS on demand. A periodic task detects new partitions and triggers a SYN2ALLUXIO job to add matching partitions to the Alluxio table, keeping data synchronized.

Crowd Computing Tasks

When the underlying table is cached in Alluxio, Hera uses the Alluxio table for crowd calculations, achieving a 10‑30% performance gain thanks to data locality and isolated compute resources.

Conclusion

Hera now supports many production workloads, but challenges remain, such as handling inconsistent function signatures across engines (e.g., Presto vs. ClickHouse) and further optimizing crowd‑computing with industry‑standard ClickHouse Bitmap solutions.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Distributed Scheduling Data Service Hera Multi-Engine Query Vipshop

Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.