Backend Development 17 min read

Design and Implementation of an Online Configurable Data Consumption Service for NetEase Cloud Music Frontend Performance Monitoring (Corona)

The article details NetEase Cloud Music’s end‑to‑end, online‑configurable data‑consumption service and schema‑driven visualization platform that transform raw client logs into ClickHouse records, automatically generate tables and dashboards, and provide observability, dramatically reducing manual effort while supporting over twenty performance metrics for frontend monitoring.

NetEase Cloud Music Tech Team

Apr 11, 2024

Design and Implementation of an Online Configurable Data Consumption Service for NetEase Cloud Music Frontend Performance Monitoring (Corona)

In 2022, NetEase Cloud Music's public technology team and the large‑frontend team built a performance monitoring service from scratch, covering more than 20 monitoring scenarios and over 100 dashboards. After a year of production, this article reviews the platform‑side design and implementation, focusing on the data‑consumption‑to‑visualization pipeline.

The classic performance monitoring data flow consists of: client SDK log collection → log transport → data consumption & modeling → data storage → visualization analysis. This article concentrates on the middle stages (data consumption, modeling, storage, and visualization) and does not discuss the client SDK.

Technical Architecture Overview

The choice of time‑series database is critical because it connects data consumption to visualization. After encountering pain points with InfluxDB, the team selected ClickHouse, which has proven to meet the performance analysis requirements of Cloud Music.

The service needed to support five client platforms and more than 20 monitoring metrics within limited time and manpower.

Online Configurable Data Consumption Service

3.1 Main Work of the Data Consumption Service

Using the cold‑boot monitoring scenario as an example, a simplified client log looks like:

{
  "props": {
    "mspm": "NativeApplication",
    "category": "Perf",
    "type": "coldBoot",
    "coldBootDataType": "000",
    "coldBootData": [
      {
        "name": "LAUNCH",
        "during": 800,
        "module": [
          {"name": "initNetwork", "during": 21},
          {"name": "initNavigator", "during": 6},
          ...
        ]
      },
      {
        "name": "MAIN_PAGE",
        "during": 100,
        "module": [...]
      }
    ],
    "brand": "Apple",
    "model": "iphone13,4",
    "appname": "music"
  },
  "os": "iphone",
  "osver": "15.5",
  "appver": "9.0.25",
  "buildver": "4742",
  "logtime": 1711958766
}

Key fields such as props.type (e.g., coldBoot) identify the monitoring item, while props.coldBootData contains detailed stage and module timing. Common fields like os and osver appear in all logs.

Typical analysis requests include computing average, P50, and P90 for total launch time (LAUNCH + MAIN_PAGE), for individual stages, or for specific modules.

To satisfy these requests, the service converts a single raw log into multiple database records, for example:

[
  {
    "table": "cold_boot_multi_stage", // multi‑stage summary
    "row": {
      "stageName": "LAUNCH,MAIN_PAGE",
      "stageCost": "900",
      "coldBootDataType": "000",
      "appName": "music",
      "appVersion": "9.0.25"
      // ... other fields omitted
    }
  },
  {
    "table": "cold_boot_stage", // single stage
    "row": {"stageName": "LAUNCH", "stageCost": 800}
  },
  {
    "table": "cold_boot_stage",
    "row": {"stageName": "MAIN_PAGE", "stageCost": 100}
  },
  {
    "table": "cold_boot_module", // single module
    "row": {"stageName": "MAIN_PAGE", "moduleName": "initNetwork", "moduleCost": 21}
  }
  // ... other module records omitted
]

The service then batch‑writes these records into ClickHouse.

The main responsibilities of the data consumption service are:

Validate incoming logs and filter out abnormal data (e.g., excessively long cold‑boot times).

Transform raw logs into query‑friendly database rows.

Batch‑write the transformed rows into the database.

3.2 Background of the Online Configurable Service

Previously, adding a new monitoring scenario required many manual steps: mock log generation, creating a new consumer service, writing validation and transformation code, defining database schemas, writing SQL for table creation, manually altering the database, and deploying the service. With 20+ metrics, this process became time‑consuming and error‑prone.

The upgraded approach reduces the workflow to a single step: online authoring of validation and transformation logic based on real sample logs, followed by one‑click schema generation and table creation.

3.3 Development Demonstration

The UI allows automatic detection of new log types, editing of consumer configurations, setting collection rules, writing conversion logic, and auto‑generating ClickHouse table schemas. After confirming the schema, a “Create Database” button creates the tables, and a “Push Config” button activates the new data source.

3.4 Observability of the Data Consumption Service

Key metrics such as consumption latency are visualized in Grafana dashboards. Latency spikes indicate insufficient pod resources, leading to Kafka backlog. Alerts are configured to notify developers when backlog occurs.

Data Visualization Service

4.1 Schema‑Based Report Building

Traditional development of a new monitoring page requires UI design, backend API development, component implementation, and page assembly—an effort that does not scale to dozens of metrics. The platform abstracts three measurement dashboards and six process‑analysis dashboards, and provides a unified front‑back‑end component. Developers only need to declare a Schema that describes required dashboards and the underlying queries.

4.2 Dashboard Types

Measurement dashboards include:

Numeric (average, P50, P75, P90) – e.g., FPS, memory.

Lifecycle – e.g., cold‑boot total time and sub‑stage times.

Sample‑ratio – e.g., crash rate, error rate.

Process‑analysis dashboards include multi‑dimensional trend charts, performance bucket distribution, dimension ratio trends, dimension distribution ranking, normal‑distribution plots, and top‑list aggregations.

4.3 Page Development Process

Developers declare filter items, data sources, and required dashboards in a Schema, then the platform renders the page automatically. In practice, building a complex page for client‑side page‑launch monitoring took about one hour from data‑consumption modeling to visual report.

4.4 Preliminary Intelligent Analysis

The platform also provides an assisted analysis feature that builds an analysis tree to break down performance degradation by dimensions, helping developers quickly locate problematic dimensions.

4.5 Observability of the Visualization Service

Slow‑query dashboards monitor overall and per‑table query latency, ensuring the visualization service remains stable.

Conclusion

The article presented the end‑to‑end design and implementation of NetEase Cloud Music’s large‑frontend performance monitoring service, emphasizing the online configurable data‑consumption service and schema‑driven report building. The system now plays a crucial role in performance optimization, regression prevention, and release decision‑making. Future work will focus on deeper intelligent analysis to further lower the barrier for developers.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Frontend data pipeline performance monitoring ClickHouse visualization online configuration

Written by

NetEase Cloud Music Tech Team

Official account of NetEase Cloud Music Tech Team

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.