Fundamentals 27 min read

How to Build a Scalable Data Warehouse for the New WOS System

This article outlines the end‑to‑end process of designing, building, and governing a data‑warehouse model for the new commercial WOS system, covering business research, data‑domain division, multi‑layer architecture, modeling methods, practical case studies, governance challenges, and improvement strategies.

Weimob Technology Center

Aug 25, 2023

How to Build a Scalable Data Warehouse for the New WOS System

1. Introduction

To solve data duplication, inconsistent metrics, and other data issues across business lines, a new data‑warehouse model is needed for the underlying data architecture of the new commercial WOS operating system, supporting reporting, APIs, and business‑system data applications.

2. Construction Ideas

2.1 Business Research

During WOS system reconstruction, Weimeng business moved from rapid expansion to a stable, mature phase. The business process is divided into pre‑commercialization (before a customer becomes a merchant) and post‑commercialization (after becoming a merchant). Each business step’s possible data and corresponding systems are clarified to facilitate warehouse construction.

2.2 Data Domain Division

A data domain groups closely related data topics, providing a high‑level conceptual classification for easier data management and application.

Division principle: divide by data type, e.g., order data belongs to the order domain, not the guide domain.

Division scope: data‑domain division applies to DWD, DWS, DIM layers.

Naming rules: two layers – data domain and data sub‑domain, with sub‑domain naming prioritized. Naming can follow business processes, entity objects, or business names.

After naming sub‑domains, a suitable overall name can be given to the data domain.

2.3 Data‑Warehouse Layered Architecture

The layered architecture includes ODS, DWD, DWS, DIM, ADS, and OUTPUT layers, each with specific functions:

ODS (source layer): stores OLTP data with minimal processing, focusing on source performance and storage cost.

DWD (detail layer): uses dimensional modeling, degrades dimensions to fact tables, builds fine‑granularity fact tables, and improves data usability.

DWS (summary layer): builds public‑granularity summary fact tables based on ADS requirements, ensuring consistent metrics.

DIM (dimension layer): establishes consistent enterprise dimensions to reduce calculation inconsistencies.

ADS (datamart layer): organizes data by business perspective for fast response, avoiding metric calculations.

OUTPUT (export layer): pushes ADS‑processed results to downstream databases, focusing on performance and stability, prohibiting metric calculations.

2.4 Model Construction

Four Common Modeling Methods

Dimensional modeling: builds models from analysis needs, emphasizing fast query performance with data redundancy.

Schema modeling (ER model): 3NF design from enterprise perspective, requiring high modeling skill.

Data Vault: composed of Hub, Link, Satellite, focusing on data integration.

Anchor model: highly extensible, normalizes to 6NF, resembling key‑value structures.

Considering internet industry characteristics, dimensional modeling is chosen for rapid business response.

Dimensional Modeling Steps

1) Choose business process → 2) Declare granularity → 3) Confirm dimensions → 4) Confirm facts.

Business process: operational activities like order placement, after‑sale handling, marketing creation. Ensure each process’s core data is extracted once.

Granularity declaration: define row level in fact tables; start with atomic granularity to support detailed analysis.

Dimensions: background attributes (who, where, when) that serve as the “soul” of the warehouse.

Facts: measurable values of business events, aligned with declared granularity.

Model Design Principles

Key principles include selecting facts related to business processes, decomposing non‑additive facts, declaring granularity before choosing dimensions/facts, avoiding mixed granularity in a fact table, keeping fact units consistent, and standardizing null, abnormal, and discrete values.

Practical Results

3.1 Warehouse Layer Results

The layered architecture (ODS, DWD, DWS, DIM, ADS, OUTPUT) remains unchanged across multiple development groups, with a public layer added for DWD, DWS, DIM to host common data and logic.

Dependencies are defined, e.g., DWD (public) can depend on ODS; DWD (business) can depend on ODS and public DWD; DWS (public) can depend on both public and business DWD, etc.

3.2 Data‑Domain Division Results

Thirteen domains were defined: product, sales, finance, transaction, log, merchant, product, asset, role, activity, service, configuration, efficiency. Each domain contains sub‑domains; merchant domain is separate due to its importance.

3.3 Model Design Cases

Case A – DWD Design

Requirement: daily order amount metrics for internal and imported orders. Solution: separate DWDs for internal (incremental load) and external (full load) orders, handling historical timestamps and data volume differences.

Case B – DWS Design

Requirement: yearly unique visitor count for merchant stores. Solution: add a lightweight DWS aggregation layer before the full DWS to reduce data volume by ~70%.

Case C – DIM Design

Requirement: daily/weekly/monthly on‑sale product counts. Solution: use a slowly changing dimension (link table) with an open‑ended start date and a far future end date, storing only changes to reduce storage and computation.

3.4 Model Layer Distribution Results

Overall BI team model proportion: DWD : DWS : DIM : ADS ≈ 1.5 : 2 : 1 : 5. DWS proportion is slightly high, indicating room for improvement.

4. Model Governance

4.1 Practical Issues

Non‑standard dependencies (e.g., OUTPUT directly depends on ODS).

Duplicate model construction across teams.

Over‑design due to unwillingness to reuse existing models.

Inconsistent table naming.

Lack of table descriptions.

ADS logic not centralized in public layers.

4.2 Governance Solutions

Three phases:

Pre‑stage

Standard promotion: disseminate data‑development process standards (domain division, design principles, layer principles, dependency rules).

Product tools: embed standards (planning, dimensional modeling, metrics, data standards) into systems; enforce model layer selection and metadata during table creation.

During Stage

Process standardization: demand review, design, development, self‑test, test, release, operation.

Demand review: clarify background, purpose, value, priority, module, metric definitions.

Design: produce requirement analysis, layer flow diagram, logical and physical model documentation.

Development: use the data platform for rapid model building and ETL coding, with code testing tools.

Self‑test: check data completeness, uniqueness, consistency, accuracy, and validity.

Testing: conduct test case review and execution.

Release: prepare release and rollback plans, ensure compliance and safety.

Operation: monitor model quality, ETL timeliness, and conduct post‑release governance.

Post‑stage

Model coverage metric: (total DWS/ADS/OUTPUT tasks – tasks directly referencing ODS) / total tasks × 100%.

Model reuse metric: total downstream reuse count of DWD/DWS/DIM / number of DWD/DWS/DIM models.

4.3 Governance Effect

Model coverage improved from 93.37% to 97.82%.

5. Conclusion

Model coverage is now managed centrally with positive progress; model reuse governance will be addressed next. Future improvements include enhancing the data‑asset dashboard for easier model discovery, enforcing approval for new models to reduce duplication, and scanning code for redundant segments.

6. References

1. Data Warehouse Toolbox (3rd Edition)

Recommended readings: “Weimeng Real‑Time Link Building User Data Platform”, “Weimeng Business Search Platform Design and Implementation”, “Data Asset Governance Practices”, “Enterprise Data Security Protection: Encryption and Masking”, “Storytelling with Data – Visualization in Data Products”.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

modeling ETL Governance

Written by

Weimob Technology Center

Official platform of the Weimob Technology Center

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.