Enterprise Data Warehouse Development Playbook: Standard Engineering Edition
This playbook provides enterprise‑level data warehouse engineers, ETL developers, data modelers, and data‑team managers with a complete, logical, and actionable set of standards, processes, and best‑practice guidelines covering architecture, development principles, role responsibilities, end‑to‑end workflow, metadata, security, performance metrics, and team collaboration.
Preface
The manual targets enterprise data‑warehouse developers, ETL engineers, data‑modeling engineers, and data‑team managers, offering a complete, rigorous, logical, and clearly structured set of development standards and process guidelines that can be directly applied.
Core Concepts
Data Warehouse (DW) : A subject‑oriented, integrated, stable, time‑variant collection of data that supports management decisions.
Four‑layer architecture :
ODS – raw data layer, ingests data as‑is with light cleaning.
DWD – detailed data layer, performs cleaning, integration, standardisation, and deduplication.
DWS – aggregated data layer, provides light aggregation, common metrics, and wide tables.
ADS – application data layer, highly aggregated for reports, dashboards, and interfaces.
ETL : Extract‑Transform‑Load, the core workflow of data‑warehouse development.
Development Principles
Layered decoupling – each layer depends only on the layer above; reverse dependencies are prohibited.
Single data source – each metric has a unique source to avoid duplicate calculations.
Model‑first development – define subject domains, dimensions, and facts before creating tables or writing SQL.
Traceability and rollback – retain full‑process logs, version control, and scripts that can be rolled back.
Quality left‑shift – embed quality controls throughout requirement, design, development, and testing phases rather than fixing issues later.
Roles and Responsibilities
Required Skills
Proficient in SQL (Hive/Spark) and data modelling (dimensional or relational).
Familiar with scheduling tools (Airflow/DolphinScheduler), metadata management (Atlas), and data‑quality frameworks (Great Expectations or custom).
Understand business processes, metric definitions, data lineage, and lifecycle management.
Ability to troubleshoot, optimise SQL, and tune cluster resources.
Core Responsibilities
Participate in requirement reviews, metric definition, and definition alignment.
Design data‑warehouse layers, table structures, and models.
Write, test, and deploy ETL scripts, scheduling tasks, and quality rules.
Ensure data accuracy, consistency, completeness, and timeliness.
Handle data issue investigation, SQL optimisation, task operations, and documentation.
Collaborate on data governance, metadata, permission management, and security compliance.
End‑to‑End Process Implementation
Phase 1: Requirement Analysis & Metric Alignment (Critical)
Requirement collection – receive PRD, clarify purpose, time range, dimensions, metrics, granularity, and timeliness.
Metric inventory – list metric name, definition, calculation logic, source, period, and precision.
Metric alignment – conduct a four‑party review (business, product, data‑warehouse, reporting) and produce a Metric Specification Document containing unique IDs, names, definitions, and calculation logic.
Phase 2: Data‑Warehouse Modelling & Layer Design
Domain segmentation – split by business areas (user, order, product, payment, logistics, marketing).
Dimensional modelling – star or snowflake schema; dimension tables are stable, low‑change, reusable.
Fact tables – record business events (order, payment, click), large, fast‑growing, partitioned by time.
Design principles – single‑purpose dimensions, clear facts, consistent granularity, controlled redundancy.
Layer‑specific designs:
ODS – mirror source tables, add partition ( dt), load time ( etl_time), and source system fields; naming: ods_<em>dbname</em>_table_{full/incr}.
DWD – one fact per wide table, integrate multiple ODS tables, denormalise dimensions; naming: dwd_<em>domain</em>_fact_{granularity}.
DWS – aggregate by common dimensions, compute shared metrics; naming: dws_<em>domain</em>_agg_{dim_combination}.
ADS – highly aggregated for BI/reporting, prioritise query performance; naming: ads_<em>report</em>_<em>module</em>_{period}.
Phase 3: ETL Development & Script Writing
Development standards – enforce SQL style: keywords uppercase, table/column names lowercase with underscores, no SELECT *, explicit field lists, split complex logic into CTEs or sub‑queries, limit a single SQL to ≤200 lines.
Layer‑specific development:
ODS – INSERT OVERWRITE for full loads (partitioned by day) or INSERT INTO for incremental loads based on update_time or log offsets; mandatory deduplication, null filtering, encoding standardisation, and type alignment.
DWD – read ODS, filter invalid rows, join dimension tables, clean/transform (standardise, mask, replace outliers), deduplicate by unique key (e.g., order_id), write to partitioned tables; embed quality rules (non‑null, uniqueness, range, enum).
DWS – aggregate DWD by common dimensions (e.g., day + region), compute shared metrics to avoid downstream duplication.
ADS – aggregate for reports, optimise for query speed, optionally duplicate for performance; keep table size ≤10 million rows per table (resource‑dependent).
Version control – store all scripts in Git/SVN, branch by project/module, commit format: [layer] table_name: change_description; mandatory code review by senior developers or architects before production deployment.
Phase 4: Data‑Quality Control (Across All Stages)
Four quality dimensions – completeness (non‑null rate, record count), accuracy (valid ranges, correct enums, correct calculations), consistency (cross‑table metric consistency, unified dimension codes, single metric definition), timeliness (on‑time delivery, acceptable latency).
Quality rule configuration – each rule is quantifiable and can trigger alerts (e.g., ODS source‑ingest success = 100 %, null rate < 1 %, duplicate rate = 0; DWD unique‑key duplicate = 0, key metric non‑null = 100 %, anomaly rate < 0.1 %; DWS/ADS metric drift ±30 % alerts, record‑count drift < 20 %).
Quality execution – embed filters in SQL during development, run quality‑check scripts in testing, generate a Data Quality Test Report, and enforce quality checks in scheduling; failures abort tasks and send alerts.
Operations – daily quality‑daily reports, 24‑hour response for issues, monthly quality review reports.
Phase 5: Task Scheduling & Dependency Management
Scheduling tools – Airflow, DolphinScheduler, DataX.
Frequency – daily (most common), hourly, minute‑level, or real‑time.
Strict layer dependency – ODS → DWD → DWS → ADS.
Retry policy – automatic 2‑3 retries with 5‑10 minute intervals.
Alerting – failure, timeout, or quality‑check failure triggers email and enterprise‑messenger alerts.
Configuration – task naming: layer_<em>table</em>_<em>period</em>; timeout settings (ODS 2 h, DWD 4 h, DWS 2 h, ADS 1 h); priority levels (core high, non‑core medium, temporary low).
Phase 6: Testing, Release, and Change Management
Testing – unit tests (script success, expected results, quality rules), integration tests (full‑chain data flow, lineage, no loss), UAT (business validates metric definitions, accuracy, and report display).
Release – after test pass, submit release request, obtain approval, deploy scripts and schedule tasks to production, perform first‑run manual verification, then update documentation, metadata, and lineage.
Change management – any table/field/metric change requires a Change Request with reason, impact, and rollback plan; after review, validate in test, execute in production, update documentation, metadata, and lineage.
Metadata Management & Data Lineage
Technical metadata (table name, fields, types, partitions, storage path, creation time, owner); business metadata (metric definition, granularity, meaning, associated dimensions, usage scenarios); operational metadata (task run logs, load timestamps, update cycles, change history). All tables, fields, and metrics must be recorded in a metadata system (e.g., Atlas) and synchronised within 24 hours of any change. Full end‑to‑end lineage from source systems through ODS, DWD, DWS, ADS to reports/interfaces must be visualised.
Security & Permission Management
Data Security
Sensitive data masking – mask phone numbers, ID cards, bank cards before ingestion (e.g., 138****1234).
Data classification – public, internal, confidential with corresponding access controls.
Access audit – all data access, operations, and exports are fully logged for traceability.
Permission Control (Least‑Privilege)
Development environment – read/write for data‑warehouse developers, read‑only for testers.
Production environment – developers have read/write only on tables within their domain; no drop/alter permissions.
Operations – scheduling and monitoring permissions only, no data read/write.
Business/Reporting – read‑only on ADS layer, no access to ODS/DWD raw data.
Permission approval workflow – request → owner approval → security review → grant; changes follow the same process.
Performance Evaluation & Team Collaboration
Quantitative Metrics
Efficiency – average requirement delivery ≤5 days, script development time, task success rate ≥99.5 %.
Quality – data‑quality issue rate ≤0.5 %, metric accuracy 100 %, metric definition consistency 100 %.
Operations – mean time to recovery ≤2 h, alert‑response rate ≥95 %, CPU utilisation 60‑80 %.
Team Collaboration
Cross‑role coordination – business provides accurate requirements and validates metrics; product defines priorities and metric definitions; data‑warehouse engineers handle modelling, development, testing, release, and operations; reporting/BI builds dashboards and gathers feedback.
Communication – daily stand‑ups for progress, risk, and issues; weekly reviews for requirements, design, quality, and operations; shared documentation repository for all design, metric, test, and operations docs.
Appendix
This manual is a universal standard for enterprise‑level data‑warehouse development; individual business lines may add specifics but must not lower core processes or quality requirements.
The data‑architecture team is responsible for interpretation, updates, and maintenance, with at least an annual review.
All data‑warehouse personnel and related roles must strictly adhere to this manual and it will be incorporated into performance assessments.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Tech Team
Focuses on big data, data analysis, data warehousing, data middle platform, data science, Flink, AI and interview experience, side‑hustle earning and career planning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
