Optimization Practices for Real-Time Data Warehouse Governance at NetEase Cloud Music
This article details the current challenges, governance motivations, architectural design, and technical optimizations—including Flink SQL tuning, Kafka batch improvements, partitioned stream tables, containerization, and automated governance—implemented to enhance the efficiency, stability, and cost-effectiveness of NetEase Cloud Music's real-time data warehouse platform.
The NetEase Cloud Music data warehouse platform has been in production for over six years, serving more than 700 users and handling over 1,600 real‑time and 7,000–8,000 offline tasks daily, with a compute cluster of 2,000+ machines processing petabytes of logs.
The platform aims to bridge technology and business by providing a customized, business‑centric data platform that integrates group‑level services (e.g., Flink‑based real‑time task development, the "Mammoth" offline platform, metadata management, and Ranger‑based security) while adding bespoke components to meet internal workflow and cost‑control requirements.
Governance is driven by cost‑reduction pressure, high Kafka watermarks, a three‑fold traffic increase from new tracking data, and the fact that most users are non‑professional data developers, leading to frequent performance and configuration issues.
Governance planning consists of four parts: (1) understanding the current state via real‑time resource monitoring (Smildon) and virtual department queues; (2) "movement‑style" governance to manually clean up legacy tasks; (3) technical optimizations to reduce resource usage and improve stability; and (4) establishing sustainable, automated governance mechanisms.
Technical optimizations include:
Flink SQL enhancements such as pre‑deserialization keyword filtering, asynchronous dimension‑table joins, and separating Kafka read concurrency from downstream processing using rescale/rebalance.
Kafka batch improvements by refining monitoring, balancing partition loads, and adopting the Sticky Partitioner to increase batch size and lower cluster watermarks from 80% to 30%.
Designing partitioned stream tables inspired by Hive partitioning, enabling automatic topic routing and partition pruning to dramatically cut unnecessary data consumption.
Future plans focus on containerizing the data platform with Kubernetes for fine‑grained resource isolation, precise CPU allocation, macro‑level monitoring, and flexible scheduling, as well as building an automated governance platform that leverages metadata to enforce rules, perform pre‑deployment checks, and continuously scan for violations.
The Q&A section confirms that the partitioned stream table supports unified batch‑stream modeling via a data‑model layer and that a low‑code tool (FastX) can generate consistent DSL for both Flink and Spark, enabling a single logic to run in real‑time and offline environments.
Overall, the article presents a comprehensive roadmap for optimizing real‑time data warehouse operations, balancing performance, cost, and maintainability.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.