Big Data 15 min read

Mastering Data Warehouse Design: From Fact Tables to Dimensional Modeling

This article explains the core components of a data warehouse ecosystem, distinguishes fact and dimension tables, outlines synchronization strategies, introduces star, snowflake, and constellation schemas, and details the layered architecture from ODS to data marts for effective big‑data analytics.

Architect's Guide
Architect's Guide
Architect's Guide
Mastering Data Warehouse Design: From Fact Tables to Dimensional Modeling

1. Technical Overview

FLINK (1.12.1) – Distributed computing engine

HIVE (3.1.2) – Popular HQL data‑warehouse tool

PHOENIX (5.0.0) – HBase SQL query engine

SPARK (3.0.1) – Distributed computing engine

SQOOP (1.4.7) – Data ingestion and export service

TEZ (0.10.0) – Optimizes MapReduce DAG

YARN (3.1.1) – Distributed resource scheduler

HDFS (3.1.1) – Distributed storage service

ZOOKEEPER (3.4.13) – Distributed coordination service

ALERTMANAGER (0.21.0) – Sends monitoring alerts

GRAFANA (6.5.1) – Visualizes monitoring data

INFLUXDB (1.8.0) – Stores monitoring data

NODEEXPORTER (1.0.0) – Reads node resource metrics

PROMETHEUS (2.18.1) – Pulls monitoring data

USDPMONITOR (1.0.0) – Reads component monitoring metrics

HUE (4.8.0) – Visual management service

ZEPPELIN (0.9.0) – Visual management service

DOLPHINSCHEDULER (1.3.6) – Task scheduling service

2. Types of Tables in a Data Warehouse

2.1 Fact Tables

Fact tables are divided into transactional fact tables (immutable records such as transaction logs) and periodic fact tables (records that change over business cycles, e.g., order status). They contain only primary‑key values that reference dimension tables, store numeric measures, and grow rapidly.

2.2 Dimension Tables

Dimension tables describe business entities (e.g., regions, product categories) and can be regular or static. Each has a single primary key used as a foreign key in fact tables and is typically wide, flat, and non‑normalized with many low‑granularity text attributes.

2.3 Summary of Table Concepts

In a data warehouse, strict normalization is unnecessary because the primary goal is analytical querying, not frequent updates. Fact tables aim to correctly record historical information, while dimension tables are designed to enable flexible aggregation from appropriate perspectives.

2.4 Table Synchronization Strategies

2.4.1 Dimension Tables

Dimension data changes over time; choose full load or slowly changing dimension (SCD) approaches based on business needs. For slowly changing dimensions, use SCD (link) tables or history tables to preserve past values.

2.4.2 Fact Table Synchronization

Transactional Fact Tables

Only daily incremental loads are needed because data never changes after insertion.

Periodic Fact Tables

Daily full load – impractical due to data volume and redundancy.

Daily incremental – cannot reflect data changes for historical snapshots.

Link (SCD) table – combine daily inserts and changes to enable time‑slice snapshots; requires capturing daily new and changed rows.

2.4.3 Design of Link (SCD) Tables

Store only changed rows, avoiding redundant data.

Enable retrieval of historical snapshots.

Add two columns (dw_start_date, dw_end_date) to indicate row validity; dw_end_date defaults to 9999‑12‑31.

When data changes, set the old row's dw_end_date to the change time and insert a new row with dw_start_date set to that time.

Typical workflow: initial full load, then incremental loads into ODS, followed by merging into historical tables.

3. Dimensional Model Design

3.1 Basic Concepts of Dimensional Modeling

Dimensional modeling starts from analytical decision‑making needs, building models that serve analysis requirements and provide good performance for large‑scale complex queries. It is a method specifically for analytical databases, data warehouses, and data marts.

3.2 Three Modeling Patterns

3.2.1 Star Schema

The most common approach: a central fact table surrounded by directly linked dimension tables, forming a star shape. Characteristics:

Dimension tables only connect to the fact table; no inter‑dimension relationships.

Each dimension table’s primary key is a foreign key in the fact table.

Typical star schema diagram shown below.

3.2.2 Snowflake Schema

An extension of the star schema that normalizes dimension tables into multiple related tables. Although more normalized, it is harder to understand, has higher maintenance cost, and lower query performance, so it is less commonly used.

3.2.3 Constellation Schema

Derived from the star schema but based on multiple fact tables sharing dimension tables. In many real‑world scenarios, several fact tables exist within the same dimensional space, and a dimension table may be reused by multiple facts, making the constellation model prevalent in later stages of development.

4. Data Warehouse Theory

4.1 Why Layer the Data?

Clear data structure – each layer has a specific scope, simplifying usage and understanding.

Data lineage tracking – enables quick pinpointing of source issues.

Reduce duplicate development – standardized layers allow reusable intermediate data.

Simplify complex problems – each layer handles a single step, making maintenance easier.

4.2 Data Layers

4.2.1 Operational Data Store (ODS)

Raw data layer that stores source system data after ETL extraction, preserving original granularity and structure, typically partitioned by business domain.

Modeling Approach and Principles

Incremental extraction from source systems, retain timestamps as needed, partition tables periodically, keep data unchanged from source, and organize by business themes.

4.2.2 Detailed Data Layer (DWD)

Cleaned version of ODS data (removing nulls, dirty data, anomalies) with the same structure and granularity, providing detailed historical data for downstream layers.

Modeling Approach and Principles

Add business date fields to support re‑processing, partition by year/month/day, and merge incremental ODS data with previous DWD tables.

4.2.3 Data Service Layer (DWS)

Aggregates and models data from DWD by business themes (e.g., tourism consumption), involving dimensions, facts, metrics, and granularity.

Modeling Approach and Principles

Aggregate and derive new facts.

Join facts from other themes; cross‑domain possible.

May use denormalized designs for performance.

4.2.4 Data Application Layer (ADS)

Provides data products for reporting, dashboards, OLAP, and analytics, typically stored in ES, MySQL, Hive, or Druid.

Modeling Approach and Principles

Keep data volume small.

Use star‑schema dimensional modeling.

Include dimension surrogate keys and measures.

Add business date fields for re‑runs.

Store without partitioning.

4.2.5 Data Mart Layer (DM)

Intermediate layer between DWS and ADS, building theme‑based models that can serve both analysis and ADS, essentially a small data warehouse.

Modeling Approach and Principles

Minimize computation at access time, optimize retrieval.

Use star‑schema dimensional modeling.

Widen facts, pre‑compute measures.

Partition storage.

big datadata warehouseETLdimensional modelingfact table
Architect's Guide
Written by

Architect's Guide

Dedicated to sharing programmer-architect skills—Java backend, system, microservice, and distributed architectures—to help you become a senior architect.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.