Big Data 25 min read

NetEase Yanxuan Data Task Governance Practice: Pre‑, In‑, and Post‑Operation Strategies

NetEase Yanxuan tackled data‑task governance by establishing pre‑operation guarantees, baseline‑driven in‑operation controls, and post‑operation interventions, delivering stable task output, reduced alarms, lineage awareness, rapid incident recovery, and reusable best‑practice products that earned the 2020 Technology Sharing Co‑building Award.

NetEase Yanxuan Technology Product Team

Feb 5, 2021

NetEase Yanxuan Data Task Governance Practice: Pre‑, In‑, and Post‑Operation Strategies

In 2020 NetEase Yanxuan identified several urgent pain points in data‑task governance. Under the leadership of department heads and the Hangzhou Research Institute, a joint team was formed to co‑build solutions. Over a year of collaboration, the project not only solved concrete warehouse problems but also produced reusable products, notable design ideas, and valuable experience, earning the 2020 “Technology Sharing Co‑building Award”.

At the DataFunTalk year‑end conference (2020‑12‑20) the author shared the practice titled “NetEase Yanxuan Data Task Governance”. The talk focused on the three‑stage approach – pre‑, in‑, and post‑operation – and the challenges faced, such as timely and stable task output, alarm reduction with actionable interventions, link‑awareness, loss‑prevention, and rapid incident recovery.

Background

The main improvement points were:

Model design and development standards

Task operation (timely, accurate, stable output; fast fault localisation; impact assessment)

Alarm optimisation (reduce night‑time alerts, provide intervention measures)

Link awareness (task lineage, impact propagation)

Testing environment and assistance

Rapid recovery for critical incidents

1. Pre‑operation – The First Line of Defence Before Model Release

Five guarantees were defined: process, model design, data quality, testing, and link awareness.

1.1 Process Guarantee – The development workflow was refined into three stages: requirement, development, and production. Requirements are captured via JIRA, reviewed, and recorded in the metric management system (Cangjie). Development includes model design documentation, design review, and adherence to conventions such as one‑task‑one‑model, consistent naming, and separation of sync tasks.

1.2 Model Design Guarantee – Emphasises “design before development”. Models are organised by layers (rdb → dwd → dws → dm). Design details (dimensions, measures, granularity) are stored in the Model Design Center. Cross‑layer dependency rates and model hotness are used to evaluate design quality.

1.3 Data Quality Guarantee – Implemented via the Data Quality Center, which enforces completeness, uniqueness, validity, consistency, and precision through configurable audit rules at table and field levels. Strong rules block task execution; weak rules generate alerts. The center also provides dashboards, rankings, and scoring.

1.4 Testing Guarantee – A dedicated testing environment mirrors production metadata, creates *_dev databases for core layers, and supports data comparison and shape inspection. This isolates development from production data and introduces automated test reporting.

1.5 Link‑Awareness Guarantee – Provides lineage visibility from source → rdb → dwd → dws → dm. Changes in upstream models trigger alerts and workflow controls for downstream tasks, helping prevent loss‑prevention incidents.

2. In‑operation – Baseline‑Based Task Operation

A “baseline” concept classifies daily tasks into time‑based baselines (02:30, 04:30, 07:30, 09:30, default). Each baseline groups tasks by importance, application, and metric relevance, enabling differentiated strategies such as resource limits, priority queues, and kill policies.

Alarm reduction measures include:

Consolidating multiple alerts into a single baseline‑based alarm.

Smart cancellation of phone alerts within 20 minutes.

Adjustable alarm intervals and counts.

Aggregating repeated alarms into the next scheduled alarm.

2.3 Key‑Link Diagnosis – Calculates the longest‑running task chain on a baseline every 10 minutes, compares recent execution curves with historical averages, and pinpoints the exact upstream task causing delay.

2.4 Impact Assessment – After a failure, the system lists affected downstream tasks, models, and services. Future work aims to drill down to metric‑level impact.

3. Post‑operation – Intervention Measures and Normalisation

Intervention actions include killing default‑baseline tasks, setting “do‑not‑disturb” windows, key‑link diagnosis, dynamic queue balancing, and a knowledge‑base of incident resolutions.

3.2 Rapid Incident Recovery – Introduces a “freeze pool” concept: freeze the root tasks and their downstream running tasks, then thaw and replay them with controlled parallelism, avoiding duplicate executions and resource waste.

3.3 Normalised Practices – Includes cold‑task grading and automated decommission, time‑consuming task optimisation, engine migration (Hive → Spark), long‑chain task splitting, dimension‑table redesign, and alarm fallback optimisation.

3.4 Monitoring & Retrospective – Defines metrics such as phone‑alarm count, effective response count, response rate, and average response time. Dashboards in the Task Operation Center and weekly BI reports provide visibility, while regular post‑mortems drive continuous improvement.

Future Thoughts

Multi‑link diagnostics and ranking.

Joint baseline‑task issue localisation.

Metric‑level impact analysis.

Further alarm configuration optimisation.

Comprehensive data‑quality evaluation framework.

The author, Jingyuan, is a senior data‑development engineer at NetEase Yanxuan, responsible for supply‑chain and finance domain architecture, with extensive experience in data‑warehouse construction and dimensional modelling.

Recruitment notice: NetEase Yanxuan’s data team is hiring senior big‑data engineers for e‑commerce warehouse development, ETL, and data‑standard governance.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Baseline Management data governance Task Operation

Written by

NetEase Yanxuan Technology Product Team

The NetEase Yanxuan Technology Product Team shares practical tech insights for the e‑commerce ecosystem. This official channel periodically publishes technical articles, team events, recruitment information, and more.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.