Big Data 15 min read

Understanding Youzan's Data Middle Platform: Architecture, Challenges, and Construction

He Fei explains how Youzan built a two‑layer data middle platform—combining a technology stack of offline, online and streaming components with an asset layer for cataloguing, quality, lineage and unified APIs—to tackle diverse business demands, technical complexity, and to enable cost‑optimized, reusable real‑time data services.

Youzan Coder
Youzan Coder
Youzan Coder
Understanding Youzan's Data Middle Platform: Architecture, Challenges, and Construction

This article, authored by He Fei of Youzan's big data team, introduces the background, challenges, and construction approach of Youzan's data middle platform.

Overview

The term “middle platform” lacks a universal definition; the author adopts ThoughtWorks' definition of an "enterprise‑level capability reuse platform." Various middle platforms exist (business, search, data, etc.). The data middle platform focuses on processing and reusing Youzan's data assets, comprising two key functions: data processing (handled by the data technology middle platform) and data reuse (handled by the data asset middle platform).

Challenges Faced by the Data Team

The data team encounters both business and technical challenges.

Business challenges

Vertical business lines are numerous (e.g., Youzan Mini‑Mall, Retail, Beauty, Education).

Multiple business domains such as products, stores, members, transactions, payments.

Diverse data needs: backend reports, operational analytics, promotion dashboards, real‑time reports.

Rapid iteration of business requirements and high compatibility demands of SaaS.

Technical challenges

Proliferation of components leading to high maintenance cost.

High development threshold for engineers unfamiliar with the ecosystem.

Typical real‑time development questions include data source integration, sink selection, high‑availability deployment, consistency semantics, resource scaling, and integration with non‑big‑data components.

Data Middle Platform Structure

The platform consists of two main parts:

Data Technology Middle Platform

Data Asset Middle Platform

Data Technology Middle Platform

To reduce development cost, it provides a suite of tool‑oriented platforms:

Basic component operation and management

Data development platform

Data asset management platform

Data metric management

Unified data services

Key big‑data components include offline components (HDFS, YARN, Hive, Spark), distributed online storage (HBase, Kafka, Druid), and real‑time engines (Storm, Spark Streaming, Flink). Each component class has distinct operational requirements (e.g., latency for real‑time, throughput for batch).

Effective operation involves:

Defining core metrics for each subsystem (e.g., HDFS TPS, latency, block loss).

Monitoring those metrics.

Setting alerts based on safety thresholds.

Custom development for security or feature gaps.

Standardized software/configuration release processes.

Regular fault‑injection drills.

Benchmarking performance.

Data Development Platform

Focuses on data processing and offers two sub‑platforms:

Offline development platform for batch ETL, scheduling, monitoring, etc.

Real‑time computation platform for streaming jobs, monitoring, and alerts.

Data Asset Management Platform

Provides a unified view of data resources across components (Hive tables, HBase tables, Druid datasources, Kafka topics). Core functions include:

Data catalog (data map) for discovery and reuse.

Data quality checks based on predefined rules.

Cost accounting for component usage.

Data lineage management for impact analysis and lifecycle control.

Data Metric Management

Manages atomic and derived metrics, ensuring consistent definitions across the organization. Atomic metrics reside in the data warehouse DW layer, while derived metrics are built by business teams on top of them.

Unified Data Service

After data is processed, it is exported to online storage and exposed via configurable API templates, reducing duplicated development for downstream services. The service is newly launched and already supports more than ten business scenarios.

Data Asset Middle Platform

Beyond technical infrastructure, the asset side emphasizes data availability for business users. Assets are categorized as offline data assets (data warehouse), real‑time data assets, and data intelligence services. The offline warehouse is organized into three layers: public data layer (ODS/DW), vertical business domain layer (DM), and data service layer (export to online storage or unified service).

Conclusion & Outlook

The Youzan data middle platform has evolved through continuous business and technical challenges. Future work will focus on cost optimization, data asset management & reuse, and real‑time warehousing.

Further Reading

Real‑time Computing Practices at Youzan – Efficiency Improvements

Youzan Data Warehouse Metadata System Practice

How We Redesigned NSQ – Features and Future Plans

Data EngineeringBig Datadata-platformdata governancemiddle platformYouzan
Youzan Coder
Written by

Youzan Coder

Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.