Big Data 18 min read

Sohu Intelligent Media Data Warehouse Architecture and Technical Practices

This article presents Sohu Intelligent Media's data warehouse construction practice, covering fundamental concepts, batch and real‑time processing, OLAP theory, multidimensional modeling, workflow management, data quality, metadata lineage, and security, with a focus on Apache Doris and a Lambda‑style architecture.

DataFunSummit
DataFunSummit
DataFunSummit
Sohu Intelligent Media Data Warehouse Architecture and Technical Practices

Guest Speaker: Zhai Dongbo (Sohu Media)

Editor: Wang Hongda

Platform: DataFunTalk, AI Enlightener

Overview:

The sharing focuses on Sohu Intelligent Media's data warehouse system construction practice, briefly reviewing basic data warehouse concepts and outlining the two main processing streams: batch (non‑real‑time) and real‑time data processing.

Batch Data Processing:

Different business scenarios require data layering, where upper‑level data is derived from lower‑level data through aggregation, joins, etc. Task scheduling is a core function, leading to metadata management, data quality, and data permission management. The talk introduces Sohu's self‑developed systems for task scheduling, metadata, data quality, and data permission management.

Real‑time Data Processing:

While the industry focuses on stream processing, many aggregation and join use cases are better served by systems that support real‑time ingestion and MPP query engines, such as Apache Doris. The talk shares Sohu's technical practice with Apache Doris.

Data Warehouse Definition:

Introduced by Bill Inmon in 1991, a data warehouse is a subject‑oriented, integrated, relatively stable collection of historical data supporting management decisions, encompassing the entire data analysis process.

Data Analysis Types:

Data Reporting: few dimensions, low latency, high concurrency.

OLAP: moderate dimensions, latency, and concurrency.

Data Mining: hundreds of dimensions, high latency tolerance, low concurrency.

OLAP:

The most frequently used analytical technique in data warehouses, defined by Edgar F. Codd in 1993. OLTP supports transaction processing, while OLAP supports decision‑making and multidimensional analysis.

Multidimensional Model:

Consists of cubes, dimensions, hierarchies, levels, members, and measures. Operations include roll‑up, drill‑down, slice, dice, and pivot.

Multidimensional Operations:

Roll‑up: Group by to remove dimensions.

Drill‑down: Group by to add lower‑level dimensions.

Slice: Add a WHERE condition.

Dice: Add two WHERE conditions.

Pivot: Reorder columns in SELECT.

OLAP Cube Construction:

Includes dimension construction (expansion, pruning) and measure construction (union of measures across cubes).

OLAP Storage Types:

ROLAP: Relational model‑based.

MOLAP: Multidimensional model stored in KV databases.

HOLAP: Hybrid of ROLAP and MOLAP.

Comparison: MOLAP offers faster query speed but slower load speed and larger storage; ROLAP has improved load speed, flexibility, and comparable query performance with modern optimizations.

Dimensional Modeling:

Based on Ralph Kimball's methodology, includes dimension tables, fact tables, and star/snowflake/galaxy schemas.

Table Layering:

Data warehouse layers include STG (raw), ODS (operational), DWD (detail), DWS (summary), ADS (application), and DIM (dimension), providing benefits such as avoiding siloed development, simplifying complexity, and clarifying hierarchy.

Data Warehouse Architecture:

Inspired by the Lambda architecture, separating real‑time and batch layers. Batch data typically has hour‑level latency and serves as the final source; real‑time data has second‑to‑minute latency for reference.

Technical Architecture:

Batch computation: Interactive analysis (Impala, Apache Doris) and batch processing (Hive, Spark).

Real‑time computation: Stream processing (Spark Streaming, Flink) and statistical analysis (Apache Doris).

Apache Doris:

An MPP analytical database from Baidu. Compared with alternatives: Kylin (MOLAP, not chosen), ClickHouse/Druid/Elasticsearch (two‑stage, limited SQL), Impala/Presto (MPP query engines with HDFS storage, limited real‑time support), and Impala+Kudu (deployment complexity, limited aggregation).

Batch Data Management:

Mirrors industry solutions with task management, metadata management, data quality, and security. Tasks are defined as DAGs, with automatic scheduling based on dependencies.

Workflow Management System:

Defines fully automated processes (workflows) that pass documents, information, or tasks between executors. The system builds DAGs where nodes represent tasks and edges represent data dependencies.

Open‑Source Workflow Tools:

Azkaban, Oozie, Airflow – each has drawbacks such as poor multi‑user collaboration, complex dependency handling, and difficulty with back‑fill.

DAG Nodes → Tasks & Instances:

Task: edited by users, contains SQL/Shell code, hourly granularity, configurable dependencies, alerts.

Instance: generated per day/hour based on task schedule, inherits task code, runs when parent tasks succeed or scheduled time arrives.

Instance Dependency Generation Rules:

The diagram illustrates how instance dependencies are generated.

Instance Dependency Example:

Hourly parent tasks aggregate into daily tables; the child task runs at 00:09 daily with self‑dependency and near‑dependency settings to meet business requirements.

Historical Data Backfill:

When a DAG needs new or corrected tasks, the root node can be re‑executed for a specific time range to repair downstream data.

Data Quality Management:

Validation unit: each task instance may produce multiple tables.

Rules: configured per table, multiple rules per table, checking row counts, key metrics, etc.

Trigger: executed after task instance; severe issues can block downstream scheduling.

Metadata Management:

Functions include table creation, modification, lifecycle, size/partition statistics, search, and lineage tracking. Lineage parsing is performed by extending Hive's SemanticAnalyzer to capture table and column lineage at task save time.

HQL->Parser->Semantic Analyzer->Logic Plan Generator->Logical Optimizer->Physical Plan Generator->Physical Optimizer->Execution

Data Security Management:

Table‑level permissions are managed via request‑approval workflow; tasks are blocked if the user lacks required table permissions.

Real‑time Data Management:

Kafka topics are treated as tables; data can be ingested into Apache Doris via Spark Streaming SQL or Doris Routine Load, which parses JSON and maps fields to the target table.

Conclusion:

The project emphasizes productization and service‑orientation, leveraging reliable open‑source components and proven business solutions to accelerate implementation.

Thank you for listening.

Real-time ProcessingBatch Processingdata qualitydata warehouseOLAPmetadata managementApache Doris
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.