How NetEase Cloud Music Cut Storage Costs by 30% Through Data Governance
This article details NetEase Cloud Music's year‑long data governance initiative, covering data background, governance strategy, project plan, practical actions, results, and future outlook, and shows how metadata‑driven management reduced storage by over 30% while improving reliability and efficiency.
Data Background
NetEase Cloud Music operates nine independent products (six domestic, three overseas) and faces massive data scale: over 20,000 online scheduling tasks, more than 50,000 tables, and 12 data projects serving over 600 users across algorithms, analysts, data products, and business services. Daily storage costs exceed 190,000 CNY and compute costs exceed 270,000 CNY.
Quality issues include unstable core tasks and reports, coarse‑grained queue resource usage, and high operational costs. Efficiency suffers because many tasks still run on Hive and Spark 2, generating small files and consuming excessive resources, especially when downstream jobs directly read ODS tables.
The environment spans five domestic clusters and overseas clusters on Alibaba Cloud and AWS.
Governance Approach
Problems were identified from a technical perspective across four layers: HDFS files, database tables, model design, and task scheduling/execution engines.
Key issues:
HDFS files lacking management, leading to “orphaned” files that waste resources.
Unrestricted database creation resulting in many unused or poorly named tables.
Model design with low CDM layer reuse, many idle tasks, and heavy reliance on ODS data.
Tasks still running on Hive and Spark 2 with resource‑intensive small‑file problems.
Solutions focused on metadata: collecting HDFS, Hive, and task metadata to enable analysis and monitoring.
Project Plan
The plan began with acquiring complete metadata from NetEase’s data‑fabric team, covering table‑level, task‑level, and HDFS‑level information.
Metadata modeling produced wide CDM tables and dimension tables, enabling multi‑dimensional views of platform data, usage by teams, domains, individuals, and tables.
The governance framework follows five principles: evidence‑based governance, clear ownership, sustainable mechanisms, measurable outcomes, and reusable methods.
Four action categories support governance: monitoring, standards, tooling, and actual governance execution.
Governance Practices
Ownership
All data, tasks, and tables are assigned owners. ODS dump tasks are configured centrally; responsibilities are linked to developers. Issues with orphaned tables from departed staff or project accounts were addressed through ODS governance, ownership reassignment, and batch‑ownership tools.
Sustainable Mechanisms
A unified promotion mechanism and principles were established to handle cross‑departmental collaboration and resource constraints.
HDFS Orphan File Governance
By correlating HDFS and Hive metadata with access logs, orphan files were identified and cleared, releasing over 7 PB of logical storage and removing more than 4.5 million files and directories.
Database Governance
From over 70 databases, 27 unused ones were decommissioned, and usage standards were defined, consolidating active databases to 22.
Table Governance
Four major initiatives:
Temporary table cleanup, reducing both stock and growth.
Lifecycle management, improving coverage.
Large‑table (cost > 100 CNY/day) optimization, targeting 163 tables that accounted for 80% of storage.
AB‑test task optimization, migrating to a new system and retiring legacy tasks.
Model Design – “Three‑Degree” Metrics
Introduced metrics for progress, health, and value, improving CDM table reuse from 30% to 60% and reducing penetration rate from 20% to 10%.
Compute Governance
Migration from Hive and Spark 2 to Spark 3, adding AQE, Z‑order, and ZSTD compression, yielding significant resource savings across multiple migration projects.
Project Outcomes
Cost & Benefit
Storage: 30% of total storage decommissioned; daily storage growth slowed from 170 TB to 55 TB.
Compute: Core and high‑cost tasks saved >30% of compute resources; cluster stability improved; core task delivery advanced from 9:00 am to 5:30 am.
Governance Assets
Created visual dashboards (data‑asset sandbox, three‑degree overview, cost‑storage sandbox, governance effect board) and monitoring tools for orphan files, tasks, and more.
Standardization
Established comprehensive data development standards covering database usage, temporary table creation, node naming, queue usage, task release, and data‑governance decommission processes.
Future Outlook
Data governance will continue evolving from fragmented to centralized, from reactive to proactive, and from experience‑based to intelligent. The three‑stage management (pre‑, mid‑, post‑governance) emphasizes preventive actions, enriched monitoring, and automated, intelligent solutions.
Data Thinking Notes
Sharing insights on data architecture, governance, and middle platforms, exploring AI in data, and linking data with business scenarios.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.