Big Data 11 min read

Youzan Data Warehouse Metadata System: From Manual Tables to Metadata‑Driven Architecture

Youzan’s data‑warehouse metadata system evolved from manually maintained tables to an automated data dictionary and finally to a metadata‑driven architecture that automatically captures technical, business, and process metadata, visualizes lineage, tracks resource usage, manages synchronization rules and permissions, and now aims to improve novice usability with visual models and impact‑analysis tools.

Youzan Coder

Aug 3, 2018

Youzan Data Warehouse Metadata System: From Manual Tables to Metadata‑Driven Architecture

Background

The business systems at Youzan originally used MySQL, while the data warehouse was built on Hive. Rapid business growth and increasing staff created a need for better metadata management.

Version 1: Manually Maintained Tables

In the early stage, the data volume was small and developers could handle ETL and analysis manually, so no metadata system was built. As dedicated data analysts joined, they needed a way to view core table information, which was not reliably documented in the source databases. A simple web‑based data dictionary was created to record field meanings, statistical definitions, and SQL statements.

Version 2: Automated Data Dictionary

Manual maintenance quickly became unsustainable as new business lines and tables were added. An automated approach was adopted:

For MySQL, periodically query the DDL management system for recently changed tables and their schemas.

For Hive, scan the Metastore every 30 seconds to detect recent DDL changes and synchronize structures via Hive JDBC.

Periodically perform full sync of all tables from MySQL and Hive.

A DDL comment standard was introduced for MySQL tables and fields, reducing future maintenance effort. DDL change alerts were also added to detect issues early.

Version 3: Metadata‑Driven Data Warehouse

With further business expansion, more users required visibility into table status and metadata. Following Kimball’s data‑warehouse theory, three types of metadata were collected automatically or computed:

Technical metadata (table structures, file paths, formats).

Business metadata (owners, business domain, lineage).

Process metadata (daily row counts, sizes, update timestamps).

Lineage

All tables—Hive, MySQL, HBase, BI reports—are treated as nodes in a DAG representing data lineage. Relationships are captured from SQL, ETL jobs (based on DataX), and MapReduce/Spark scripts. The lineage is visualized using the open‑source jsmind library, enabling users to see upstream and downstream tables.

Hive Table Row Count & Update Time

After each offline task, the system captures Hive table row counts and update times. File size and modification time are obtained directly from HDFS, while row counts are calculated with Presto, which efficiently reads only ORC index data. All Hive tables were switched to ORC format to enable fast counting.

Compute Resource Management

Resource usage of completed offline jobs is collected from YARN. For MapReduce jobs, CPU, memory, and disk I/O are recorded; for Spark jobs, CPU usage is captured. These metrics are plotted over time to spot anomalies, estimate hardware costs, and identify resource‑intensive tables or business processes.

Data Synchronization Rule Description

Metadata drives real‑time synchronization configurations. On the MySQL side, field‑table‑database mappings and migration histories are stored to generate sync rules automatically. On the Hive side, partitioning, file format, and HDFS path information are recorded for read/write configurations.

Business Domain Management & Permission

Business domain information, originally kept in a simple table, was expanded to map tables to business owners and domains. Using Apache Ranger‑style name‑matching rules, domain assignment and permission management are automated without storing permissions in the metadata system itself. Sensitive fields are identified via periodic sampling and masked in Ranger.

Next Steps

User interviews revealed that newcomers find the metadata system hard to use, while experienced users do not. Future work will focus on visualizing data models and metrics as entry points for novices, and building an impact‑analysis tool that links MySQL changes to downstream data‑warehouse effects.

Recruitment note: the big‑data team is hiring (referral email: [email protected]).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Data Warehouse Hive Metadata Management Resource Monitoring Lineage

Written by

Youzan Coder

Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.