Operations 18 min read

How Cloud Music Scaled Data Governance: Practices, Metrics, and Lessons Learned

The article details Cloud Music’s data‑governance journey, covering early modeling standards, self‑service data tools, quality and metadata management, asset‑reuse improvements, and cost‑saving Spark optimizations, while sharing concrete metrics, processes, and the team’s systematic methodology.

Past Memory Big Data

Oct 9, 2022

How Cloud Music Scaled Data Governance: Practices, Metrics, and Lessons Learned

1. Early Foundations

Cloud Music now handles billions of daily user logs, amounting to 200 PB of data, which creates both opportunity and high hardware costs. To turn this data into ROI, the team first built a public‑layer data model, created the easyDesign modeling system with Hangzhou Research Institute, and produced a comprehensive business‑bus matrix covering entities such as users, items, and scenes.

Next, they tackled data‑link governance. Starting in early 2020, they launched the EasyTracker platform to standardize event‑tracking design, development, testing, and post‑mortem management, migrating thousands of existing points to a uniform format.

To address the “last‑mile” of data access, the team delivered the self‑service tool EasyFetch . After several iterations, EasyFetch satisfied functional and usability requirements, supported over 30 business‑line data models, and was backed by more than 30 training sessions and dedicated support groups. In 2020, internal users performed over 150 000 self‑service queries, with >400 users and a peak of 100 daily active users.

2. Data‑Governance Three Pillars

Following the 2021 IPO, the team identified three focus areas for 2022: quality governance, assetization, and cost‑efficiency.

2.1 Quality Governance

The team defined explicit quality standards, reinforced compliance, and built platform tools. Metadata was treated as a first‑class asset: a quarterly effort with the “YouShu” team produced complete, accurate metadata that met governance requirements. Operational SOPs were codified, including two core rules—“no small production issue is ignored” and “trace every problem back to its source.” These measures cut task breakage rates by 60 % and reduced average incident‑resolution time by 80 %.

2.2 Assetization

Adopting a “three‑degree model” (construction progress, asset health, business value), the team quantified data‑warehouse maturity. After refactoring models and deprecating 24 000 obsolete tables, data‑asset reuse rose from 30 % to 55 %, nearly doubling efficiency. Storage growth slowed from 170 TB/day to 50 TB/day, partly due to lifecycle‑management policies.

2.3 Cost‑Efficiency

Detailed cost accounting revealed major spend on compute and storage. Spark upgrades delivered substantial savings: 266 Spark 3.1 jobs (95 % of resource usage) saw >60 % execution‑time reduction and 60 % cost cut; 631 Spark 2 jobs migrated to Spark 3 saved 28.71 % resources and improved performance by 52.07 %. A focused Z‑order + gzip effort on 170 jobs saved 68 % of storage, equivalent to 55 TB/day and ¥7.983 M per year. Cluster stability remained high, supporting the baseline‑530 project’s early‑morning data delivery.

3. Systematic Thinking

The team’s methodology draws on DAMA and industry best practices, applying standards across the data lifecycle (pre‑, in‑, and post‑process). Governance standards include SLA for quality stability, three‑degree asset evaluation, and resource‑level quantification. Organizationally, governance is embedded in production with clear role responsibilities rather than new committees, and cross‑team checks (e.g., QA overseeing data‑team SLA reports) ensure accountability.

Technically, tools are introduced only after a problem is defined; the platform team focuses on pragmatic solutions, avoiding unnecessary “big‑wheel” development. Continuous tracking of emerging technologies and collaboration with research partners sustain long‑term innovation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Metadata cost optimization Data Warehouse quality management data governance cloud music

Written by

Past Memory Big Data

A popular big-data architecture channel with over 100,000 developers. Publishes articles on Spark, Hadoop, Flink, Kafka and more. Visit the Past Memory Big Data blog at https://www.iteblog.com. Search "Past Memory" on Google or Baidu.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.