Design and Architecture of a Full‑Chain Data Warehouse for Information Security
The article presents a comprehensive design of an end‑to‑end data warehouse for information‑security governance, detailing background motivations, multi‑layer data architecture, dimension modeling, bus‑matrix mapping, real‑time (lambda/kappa) processing, data‑dictionary integration, and future directions toward unified streaming‑batch solutions.
Background – In information‑security business, massive heterogeneous data (features, policies, user behavior) must be analyzed and validated, requiring a "full‑link" data warehouse that integrates all business‑line data into a dense, highly‑integrated data mesh, turning data into proactive security production capacity.
Data Layering – The warehouse is divided into six layers:
Seq
Data Layer
Abbreviation
Purpose
1
Raw Data Layer
RAW
Snapshot of source‑system data, stored daily with full detail.
2
Basic Data Layer
ODS
Business‑concept organized data with standardized names and codes.
3
General Data Layer
DWD
Fine‑grained aggregated layer built on star or snowflake models; metrics and dimensions are standardized.
4
Aggregated Data Layer
DWS
Data marts for specific business needs, designed with star or snowflake schemas.
5
Dimension Layer
DIM
Dimension tables providing rich attributes, historical traceability, and consistency across common dimensions.
6
Temporary Layer
TMP
Transient tables to reduce computation difficulty and improve runtime efficiency.
Dimension Modeling – Two mainstream approaches (normalized vs. dimensional) are compared. Normalized warehouses require heavy upfront work but yield stable long‑term maintenance; dimensional modeling is more agile, suits frequently changing business, and demands less expertise. Four key steps are outlined: selecting business processes, declaring grain, identifying dimensions, and confirming facts.
Bus Matrix – The bus matrix acts as a map of the warehouse, linking each business process (rows) with common dimensions (columns). It provides a macro view of which processes share which dimensions, enabling quick alignment of data requirements with warehouse structures.
Overall Architecture – The warehouse is split into three logical parts:
General warehouse: stores cross‑business capability data (e.g., hunter‑risk system, cloud authentication).
Business warehouse: built for specific industry‑level analyses.
Subject warehouse: unified, cross‑business subject areas (traffic, content, user, etc.) based on consistent dimensions.
This three‑tier design mirrors the IKEA analogy: a public floor (general warehouse) for developers and a dedicated floor (business warehouse) for analysts.
Real‑Time Evolution – Discusses Lambda (batch + stream) and Kappa (stream‑only) architectures. Lambda offers flexibility but incurs double‑engine maintenance and data inconsistency; Kappa simplifies the stack by using a message queue (e.g., Kafka) and Flink, enabling stream‑to‑Hive writes and automatic small‑file compaction.
Data Dictionary – Serves as the core metadata service (Hive Metastore) that supplies schema information to streaming platforms, enabling zero‑code configuration for feature extraction, model training, and online inference.
Future Outlook – The team is exploring data‑lake‑based stream‑batch integration to replace the current Hive + Kafka pattern, and addressing emerging security challenges such as unstructured image/text attacks, requiring new data‑structuring and linkage solutions.
58 Tech
Official tech channel of 58, a platform for tech innovation, sharing, and communication.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.