Big Data 12 min read

Design and Implementation of Xianyu Real-Time Data Warehouse

To meet Xianyu’s billion‑event‑per‑day real‑time analysis needs, the team built a petabyte‑scale warehouse using Hologres for storage and Alibaba‑enhanced Flink (Blink) for streaming, organized into ODS, DWD, DWS, ADS and DIM layers, enabling minute‑level aggregations, rapid anomaly detection, and instant product‑team insights.

Xianyu Technology
Xianyu Technology
Xianyu Technology
Design and Implementation of Xianyu Real-Time Data Warehouse

Xianyu, a leading second‑hand trading app, generates nearly a hundred billion daily exposure and click events. The massive data scale creates urgent real‑time analysis requirements such as quickly locating abnormal product exposure, providing instant reports for product circles, and delivering custom alerts.

The team surveyed various data‑warehouse designs and grouped them into four categories: (1) building from scratch, (2) extending existing warehouses, (3) simplifying full‑featured warehouses, and (4) tool‑oriented solutions. Modern stream processing frameworks (e.g., Apache Storm, Flink) and analytical engines (e.g., Hologres) were identified as key innovations.

After evaluating options, the architecture was fixed on Hologres + Blink (an Alibaba‑enhanced Flink) to achieve a fully real‑time warehouse.

The data model follows a four‑layer design: ODS (Operational Data Store) for raw source data, DWD (Data Warehouse Detail) for cleaned and standardized records, DWS (Data Warehouse Service) for aggregated service data, and ADS (Application Data Store) for application‑specific queries. A DIM layer provides real‑time dimension tables for entities such as products, users, and scenes.

The overall system consists of five layers from bottom to top: data source, ingestion layer, compute layer, service layer, and application layer. Key challenges include handling petabyte‑scale event logs, meeting strict latency for monitoring and alerts, supporting complex interactive analytics, and integrating heterogeneous data sources.

Streaming is powered by Blink. Minute‑level aggregations are expressed with tumbling windows, e.g., GROUP BY TUMBLE(<time-attr>, <size-interval>) . Event time is preferred to guarantee consistent results during re‑runs.

Hologres serves as the analytical engine, offering PB‑scale storage with low‑latency, high‑concurrency queries. Its shard‑tablet architecture uses write‑ahead logs and background compaction to maintain read performance.

Heterogeneous sources are unified through domain‑level dimension statistics, allowing Blink to cleanse and enrich data with context such as user segments, buckets, and scenarios.

Initial results show real‑time reporting for dashboards, rapid detection of exposure anomalies, and immediate feedback for product teams. Future work includes tighter integration with monitoring platforms, exposing the warehouse as an open service, and further performance optimizations.

big datastream processingHologresBlinkreal-time data warehouse
Xianyu Technology
Written by

Xianyu Technology

Official account of the Xianyu technology team

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.