Data Warehouse Fundamentals, Modeling Techniques, and the Evolution of Maoyan’s Warehouse
This article explains the origins and challenges of scattered enterprise data, defines the data warehouse concept, details its four core characteristics, compares entity, normalization, and dimensional modeling methods, and illustrates Maoyan’s three‑stage data‑warehouse evolution with practical examples and diagrams.
In the early days of data warehousing, enterprises like Maoyan stored daily reports and data interfaces in disparate departmental systems, creating a complex web of data sources that were difficult to manage across business units.
Data sources are diverse and decision‑making data is fragmented.
Lack of standards makes integration hard.
Inconsistent data definitions reduce credibility.
Absence of data governance leads to poor data quality.
Extracting data from such raw storage eventually results in credibility loss, increasing data dispersion, and the erosion of standardized accuracy.
Definition of Data Warehouse
Bill Inmon’s widely accepted definition describes a data warehouse as a subject‑oriented, integrated, non‑volatile, time‑variant collection of data that supports decision‑making.
Subject‑oriented : Data is organized around business subjects (e.g., users, discounts, sales) rather than transactional processes.
Integrated : Data from heterogeneous sources is extracted, cleaned, transformed, and loaded with unified coding and summarization to eliminate redundancy.
Stable (non‑volatile) : Once loaded, data is retained for historical analysis and rarely updated, serving long‑term decision support.
Time‑variant : Historical timestamps allow trend analysis and forecasting.
Data Warehouse Modeling Methods
Entity Modeling
Entity modeling, derived from philosophy, treats business as a collection of entities, events, and descriptions. Any business fact can be broken into these three parts, making the method useful for early‑stage domain modeling.
For example, in the statement “Xiao Ming drives to school,” “Xiao Ming” and “school” are entities, “going to school” is an event, and “driving” is a description.
Entity: a specific concept or object involved in business.
Event: a process that connects entities.
Description: additional details about entities or events.
Entity modeling excels in business and domain modeling but is limited to those phases; logical and physical modeling later rely on normalization and dimensional techniques.
Normalization (Third‑Normal‑Form) Modeling
Promoted by Inmon, this method follows relational database theory, ensuring each attribute is atomic, fully dependent on the whole primary key, and free of transitive dependencies. It aligns data‑warehouse models with the enterprise’s subject‑area and logical models.
Each attribute value is unique and unambiguous.
Non‑key attributes fully depend on the entire primary key.
Non‑key attributes do not depend on attributes of other tables.
While this approach integrates well with relational systems, it can restrict flexibility and performance when aggregating data for data‑mart layers.
Dimensional Modeling
Introduced by Kimball, dimensional modeling separates data into fact tables and dimension tables, forming star or snowflake schemas. It pre‑processes dimensions (e.g., time, region, product) to accelerate analytical queries.
Dimension tables describe attributes of business entities; fact tables store measurable events.
Star schema: each dimension directly links to the fact table.
Snowflake schema: dimensions are normalized, reducing redundancy but potentially adding joins.
Typical dimensional design steps: define the analysis subject, select dimensions, decide grain of fact rows, and choose measurable metrics.
Example: analyzing recruitment for each McDonald’s store using dimensions such as time, region, brand, store, position, and channel, with metrics like applications, hires, and rejections.
Physical Layer Hierarchy in Data Warehousing
Physical modeling often splits data into three layers: detailed fact (ODS‑level), light aggregation, and topic‑wide tables.
Detail layer: fine‑grained, cleaned data similar to ODS.
Aggregation layer: business‑oriented summaries.
Topic‑wide layer: integrated views built on aggregation data.
Dimension tables: sourced directly from business or derived from facts.
Maoyan’s Data Warehouse Development History
Stage 1 (2015)
Initial warehouse focused on rapid business needs, using isolated “chimney‑style” development for each requirement, leading to high operational cost, frequent schema changes, and data inconsistency.
Stage 2
After recognizing Stage 1’s drawbacks, Maoyan built an offline warehouse organized by business themes (e.g., seat selection, discount cards, coupons), improving data reuse and consistency across similar business lines.
Stage 3
Model‑driven development addressed naming conflicts, data silos, and unified source data by constructing unified order, UGC, payment, and finance models, standardizing data definitions and export interfaces.
Comparison of the Three Stages
Maoyan Technical Team https://juejin.im/post/5d5be159f265da03970bbc3a
Recommended Reading
Tencent Data Director Tutorial: Build a Big Data Knowledge System in 5 Steps
Dada O2O Backend Architecture Evolution: From 0 to 4000 Concurrent Requests
iQIYI’s Database Selection Guide
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.