How to Build a Scalable Data Warehouse for the New WOS System
This article outlines the end‑to‑end process of designing, building, and governing a data‑warehouse model for the new commercial WOS system, covering business research, data‑domain division, multi‑layer architecture, modeling methods, practical case studies, governance challenges, and improvement strategies.
1. Introduction
To solve data duplication, inconsistent metrics, and other data issues across business lines, a new data‑warehouse model is needed for the underlying data architecture of the new commercial WOS operating system, supporting reporting, APIs, and business‑system data applications.
2. Construction Ideas
2.1 Business Research
During WOS system reconstruction, Weimeng business moved from rapid expansion to a stable, mature phase. The business process is divided into pre‑commercialization (before a customer becomes a merchant) and post‑commercialization (after becoming a merchant). Each business step’s possible data and corresponding systems are clarified to facilitate warehouse construction.
2.2 Data Domain Division
A data domain groups closely related data topics, providing a high‑level conceptual classification for easier data management and application.
Division principle: divide by data type, e.g., order data belongs to the order domain, not the guide domain.
Division scope: data‑domain division applies to DWD, DWS, DIM layers.
Naming rules: two layers – data domain and data sub‑domain, with sub‑domain naming prioritized. Naming can follow business processes, entity objects, or business names.
After naming sub‑domains, a suitable overall name can be given to the data domain.
2.3 Data‑Warehouse Layered Architecture
The layered architecture includes ODS, DWD, DWS, DIM, ADS, and OUTPUT layers, each with specific functions:
ODS (source layer): stores OLTP data with minimal processing, focusing on source performance and storage cost.
DWD (detail layer): uses dimensional modeling, degrades dimensions to fact tables, builds fine‑granularity fact tables, and improves data usability.
DWS (summary layer): builds public‑granularity summary fact tables based on ADS requirements, ensuring consistent metrics.
DIM (dimension layer): establishes consistent enterprise dimensions to reduce calculation inconsistencies.
ADS (datamart layer): organizes data by business perspective for fast response, avoiding metric calculations.
OUTPUT (export layer): pushes ADS‑processed results to downstream databases, focusing on performance and stability, prohibiting metric calculations.
2.4 Model Construction
Four Common Modeling Methods
Dimensional modeling: builds models from analysis needs, emphasizing fast query performance with data redundancy.
Schema modeling (ER model): 3NF design from enterprise perspective, requiring high modeling skill.
Data Vault: composed of Hub, Link, Satellite, focusing on data integration.
Anchor model: highly extensible, normalizes to 6NF, resembling key‑value structures.
Considering internet industry characteristics, dimensional modeling is chosen for rapid business response.
Dimensional Modeling Steps
1) Choose business process → 2) Declare granularity → 3) Confirm dimensions → 4) Confirm facts.
Business process: operational activities like order placement, after‑sale handling, marketing creation. Ensure each process’s core data is extracted once.
Granularity declaration: define row level in fact tables; start with atomic granularity to support detailed analysis.
Dimensions: background attributes (who, where, when) that serve as the “soul” of the warehouse.
Facts: measurable values of business events, aligned with declared granularity.
Model Design Principles
Key principles include selecting facts related to business processes, decomposing non‑additive facts, declaring granularity before choosing dimensions/facts, avoiding mixed granularity in a fact table, keeping fact units consistent, and standardizing null, abnormal, and discrete values.
Practical Results
3.1 Warehouse Layer Results
The layered architecture (ODS, DWD, DWS, DIM, ADS, OUTPUT) remains unchanged across multiple development groups, with a public layer added for DWD, DWS, DIM to host common data and logic.
Dependencies are defined, e.g., DWD (public) can depend on ODS; DWD (business) can depend on ODS and public DWD; DWS (public) can depend on both public and business DWD, etc.
3.2 Data‑Domain Division Results
Thirteen domains were defined: product, sales, finance, transaction, log, merchant, product, asset, role, activity, service, configuration, efficiency. Each domain contains sub‑domains; merchant domain is separate due to its importance.
3.3 Model Design Cases
Case A – DWD Design
Requirement: daily order amount metrics for internal and imported orders. Solution: separate DWDs for internal (incremental load) and external (full load) orders, handling historical timestamps and data volume differences.
Case B – DWS Design
Requirement: yearly unique visitor count for merchant stores. Solution: add a lightweight DWS aggregation layer before the full DWS to reduce data volume by ~70%.
Case C – DIM Design
Requirement: daily/weekly/monthly on‑sale product counts. Solution: use a slowly changing dimension (link table) with an open‑ended start date and a far future end date, storing only changes to reduce storage and computation.
3.4 Model Layer Distribution Results
Overall BI team model proportion: DWD : DWS : DIM : ADS ≈ 1.5 : 2 : 1 : 5. DWS proportion is slightly high, indicating room for improvement.
4. Model Governance
4.1 Practical Issues
Non‑standard dependencies (e.g., OUTPUT directly depends on ODS).
Duplicate model construction across teams.
Over‑design due to unwillingness to reuse existing models.
Inconsistent table naming.
Lack of table descriptions.
ADS logic not centralized in public layers.
4.2 Governance Solutions
Three phases:
Pre‑stage
Standard promotion: disseminate data‑development process standards (domain division, design principles, layer principles, dependency rules).
Product tools: embed standards (planning, dimensional modeling, metrics, data standards) into systems; enforce model layer selection and metadata during table creation.
During Stage
Process standardization: demand review, design, development, self‑test, test, release, operation.
Demand review: clarify background, purpose, value, priority, module, metric definitions.
Design: produce requirement analysis, layer flow diagram, logical and physical model documentation.
Development: use the data platform for rapid model building and ETL coding, with code testing tools.
Self‑test: check data completeness, uniqueness, consistency, accuracy, and validity.
Testing: conduct test case review and execution.
Release: prepare release and rollback plans, ensure compliance and safety.
Operation: monitor model quality, ETL timeliness, and conduct post‑release governance.
Post‑stage
Model coverage metric: (total DWS/ADS/OUTPUT tasks – tasks directly referencing ODS) / total tasks × 100%.
Model reuse metric: total downstream reuse count of DWD/DWS/DIM / number of DWD/DWS/DIM models.
4.3 Governance Effect
Model coverage improved from 93.37% to 97.82%.
5. Conclusion
Model coverage is now managed centrally with positive progress; model reuse governance will be addressed next. Future improvements include enhancing the data‑asset dashboard for easier model discovery, enforcing approval for new models to reduce duplication, and scanning code for redundant segments.
6. References
1. Data Warehouse Toolbox (3rd Edition)
Recommended readings: “Weimeng Real‑Time Link Building User Data Platform”, “Weimeng Business Search Platform Design and Implementation”, “Data Asset Governance Practices”, “Enterprise Data Security Protection: Encryption and Masking”, “Storytelling with Data – Visualization in Data Products”.
Weimob Technology Center
Official platform of the Weimob Technology Center
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.