Big Data 21 min read

Design and Architecture of Hera Data Service for Unified Data Access at Vipshop

The article details the background, architecture, core features, scheduling mechanisms, Lisp‑based query DSL, and Alluxio integration of Vipshop's self‑developed Hera data service, illustrating how it unifies multi‑engine data access, improves SLA, and accelerates large‑scale crowd computing tasks.

Architecture Digest
Architecture Digest
Architecture Digest
Design and Architecture of Hera Data Service for Unified Data Access at Vipshop

Data Service is a critical component of the data middle‑platform, acting as a unified entry point for applications to access the data warehouse as a single DB, providing consistent API control for data inflow and outflow to meet diverse user requirements.

Vipshop began building its own data service, Hera, in 2019, evolving from scratch to serving more than 30 business lines with both B‑to‑B and B‑to‑C data services.

Background

Before a unified data service, the warehouse suffered from low efficiency, fragmented interfaces, and inconsistent metric definitions, leading to high latency for large‑scale audience (USP/DMP) exports and difficulty maintaining numerous engine‑specific APIs.

Architecture Design

Hera follows a classic master‑slave model with separate data and control flows to ensure high availability. It consists of three layers:

Application Access Layer – supports TCP client, HTTP, and internal RPC (OSP) interfaces.

Data Service Layer – handles routing, multi‑engine support, engine resource configuration, dynamic parameter assembly, SQLLisp engine generation, adaptive execution, unified query cache, and FreeMarker‑based SQL generation.

Data Layer – abstracts underlying storage (data warehouse, ClickHouse, MySQL, Redis) behind a single API.

Core Functions

Multi‑queue scheduling with weight‑based SLA guarantees.

Multi‑engine query support (Spark, Presto, ClickHouse, Hive, MySQL, Redis).

Support for various task types: ETL, adhoc, file export, data import.

File export pipeline that writes results to HDFS/Alluxio and allows TCP download, reducing crowd export time from >30 min to ≤3 min.

Resource isolation for core vs. non‑core workloads.

Dynamic engine‑parameter assembly and adaptive engine execution that falls back to another engine on failure.

SQL construction based on dimensional modeling (single‑table, star, snowflake).

Lisp‑Based Metric DSL

Hera defines a custom Lisp syntax for expressing metric calculations, abstracting engine‑specific details. Example expressions include aggregation, conditional, type‑casting, and generic function calls. The DSL is parsed with ANTLR4, traversed by a listener, and translated into engine‑specific SQL.

Task Scheduling

Built on Netty, the scheduler uses a single EventLoopGroup for network I/O and a separate thread pool for business logic. It implements multi‑queue, multi‑user scheduling with weight‑based scoring:

queue_dynamic_factor = queue_size / queue_capacity * (1 - running_jobs / queue_parallelism)
job_weight = 1 - (current_time - enqueue_time) / timeout
score = job_weight + queue_dynamic_factor + queue_weight

Jobs are ordered first by queue weight, then by dynamic factor, and finally by individual job weight.

SQL Job Flow

Clients submit raw SQL (e.g., Presto). The SQLParser rewrites it for the target engine (Spark, Presto, ClickHouse). The Master schedules the job to Workers; if the chosen engine fails, the job is retried on another engine. Results are streamed back to the client via the Worker.

Metrics Collection

Both static (node info) and dynamic (runtime) metrics are gathered. Workers report heartbeat data, including memory usage, enabling the Master to make informed scheduling decisions.

Alluxio Cache Synchronization

Hive tables can be mirrored to Alluxio by replacing the HDFS location with an Alluxio path. A periodic task detects new partitions in the source table and triggers a SYN2ALLUXIO job to keep the Alluxio cache up‑to‑date.

CREATE TABLE `hdfs.ads_tags_table`(
  `oaid_md5` string,
  `mid` string,
  `user_id` bigint,
  ...
) PARTITIONED BY (`dt` string) LOCATION 'hdfs://xxxx/hdfs.db/ads_tags_table';
CREATE TABLE `alluxio.ads_tags_table`(
  `oaid_md5` string,
  `mid` string,
  `user_id` bigint,
  ...
) PARTITIONED BY (`dt` string) LOCATION 'alluxio://zk@IP1:2181,IP2:2181/alluxio.db/ads_tags_table';

Crowd Computing Example

A Spark ETL job that reads from the HDFS table can be automatically rewritten to use the Alluxio table, achieving a 10‑30% speedup.

INSERT INTO hive_advert.cd0000127760_full
SELECT result_id, '20210703'
FROM (
  SELECT oaid_md5 AS result_id
  FROM hdfs.ads_tags_table AS ta
  WHERE ta.dt = '20210702' AND ...
) AS t;
INSERT INTO hive_advert.cd0000127760_full
SELECT result_id, '20210703'
FROM (
  SELECT oaid_md5 AS result_id
  FROM alluxio.ads_tags_table AS ta
  WHERE ta.dt = '20210702' AND ...
) AS t;

Conclusion

Hera now supports many production workloads but still faces challenges such as inconsistent function signatures across engines and further optimization of multi‑engine compatibility.

big dataSchedulingSQL enginedistributed computingAlluxioData ServiceHera
Architecture Digest
Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.