Big Data 15 min read

Data Governance and Warehouse Evolution at NetEase Media: Architecture, Cost Management, and Future Outlook

The article details NetEase Media's data governance framework, covering business overview, data‑warehouse architecture evolution, layered data services, metadata and asset management, cost‑control strategies, and a roadmap for automated, mature data governance driven by DAMA principles.

DataFunTalk
DataFunTalk
DataFunTalk
Data Governance and Warehouse Evolution at NetEase Media: Architecture, Cost Management, and Future Outlook

01 Business Introduction

NetEase Media delivers content through portals and news clients, aiming to decentralize information access for users. The data team supports daily operational reports, AB‑testing platforms, channel analysis, personalized dashboards, and data collection for client monetization.

02 Data Architecture

The architecture consists of four layers: data ingestion, data computation, data service, and data application.

Data Ingestion Layer : Unified ingestion of structured and semi‑structured data from business databases, corporate data, client logs, and server logs into the warehouse.

Data Computation Layer : Lambda architecture separates offline (Spark on Hive) and real‑time (Flink) processing, with layers ODS, DWD, DWS, and APP.

Data Service Layer : Provides storage for data tools and standardized data services.

Data Application Layer : Supports internal and external data applications.

02 Data Warehouse Evolution

1. From 1.0 to 2.0

Before 2015, the business was portal‑centric with limited data volume and simple reporting needs. As the business expanded to multimedia content (video, live), data demands grew, leading to the creation of a dedicated data team and a dimensional‑model‑based warehouse (2.0) that delivered themed data domains, data products, and fine‑grained operational support.

2. From 2.0 to 3.0

In 3.0, wide‑table construction for analytical topics was introduced, enabling self‑service data extraction. The warehouse hierarchy was simplified from six to four logical layers, and ODS views were added to decouple data.

03 Data Governance System

1. Background

Rapid business growth created high, uncontrolled resource loads, data quality issues, and low development efficiency.

Cost : Unrestricted offline tasks caused resource contention.

Quality : High load led to unstable SLAs and missing quality controls.

Efficiency : Resource limits slowed delivery cycles.

2. Data Management Framework

Based on the DAMA guide, ten modules were built, focusing on data modeling, metadata, asset, and cost management.

① Data Modeling & Design

Two‑phase data flow: (1) data‑driven operations for rapid data access, (2) data‑maintenance operations (collection, layering, thematic modeling) to improve quality and usability.

② Metadata Management

Four metadata categories—business, technical, process, security—were defined, along with a metadata map and data‑lineage tracking using Hive plugins and task‑level lineage.

③ Data Asset Management

Assets are classified into four levels (L4‑L1) based on impact, with L4 covering global assets like management dashboards, and L1 covering low‑impact or unknown‑impact assets. Asset levels guide protection measures and audit processes.

④ Data Cost Management

Cost governance spans storage, compute, and operational cost.

Storage Cost Governance : Monitoring, zombie file cleanup, lifecycle policies, compression, and model optimization reduced physical storage by ~25%.

Compute Cost Governance : Monitoring CPU/Memory usage, eliminating zombie tasks, and migrating long‑running jobs lowered CPU usage by ~25%.

Resource Cost Operations : Pre‑, in‑, and post‑stage controls (guidelines, task reviews, automated optimization) ensure stable, efficient resource usage.

04 Data Governance Outlook

Using DAMA maturity assessment, four stages are identified: Initial (ad‑hoc tools, fragmented roles), Repeatable (standardized tools, manual processes), Managed (institutionalized processes, risk focus), and Optimized (automated, predictive governance driven by robust metadata).

Conclusion

NetEase Media’s 2021 data‑governance initiative addressed high, uncontrolled resource loads by establishing an asset‑level hierarchy and cost‑control operations, achieving stable, long‑term data production.

Future work will leverage the enriched metadata system to automate and standardize governance activities.

Q&A

Q: Should Kafka monitoring cover the whole cluster or just data read/write?

A: We monitor Kafka, MySQL, and Oracle streams for consistency between source and warehouse, applying asset‑level safeguards to core data.

Q: How to quantify data governance?

A: By linking governance outcomes to business value and ensuring high metadata coverage and accuracy.

Q: Scale of media data governance?

A: ~4,000 metadata tables, >1,200 reports, guided by DAMA principles.

Q: How is lineage implemented?

A: Table lineage via Hive SQL parsing plugins; task lineage via offline development tools, both combined for accurate lineage.

Thank you for listening.

Big DatametadataCost ManagementData Warehousedata governanceNetEase MediaDAMA
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.