Practical Application of Flink + Kafka at NetEase Cloud Music: Architecture, Platform Design, and Lessons Learned
This article presents a detailed case study of NetEase Cloud Music’s real‑time analytics platform built on Kafka and Flink, covering background, architectural choices, platform‑level design, operational challenges, solutions such as the Magina framework, and a Q&A on reliability and monitoring.
NetEase Cloud Music operates more than 200 Kafka broker nodes across 10+ clusters, handling peak QPS of over 4 million and running 500+ real‑time Flink jobs. The speaker, a senior real‑time computing engineer, outlines the system’s background and the motivations for choosing Kafka as the messaging backbone and Flink as the unified stream‑and‑batch engine.
Kafka was selected for its high throughput, low latency, massive concurrency, fault tolerance, and easy horizontal scaling. Flink was chosen for its high performance, flexible windowing, exactly‑once state semantics, lightweight fault‑tolerance, event‑time handling, and ability to run both streaming and batch workloads.
The combined Kafka‑Flink stack forms the core of a platform‑level architecture that ingests logs from client/web sources, processes them in real time, and writes results to various downstream stores. The platform has been refactored into a “Magina” layer that provides a unified SQL/SDK API, catalog management, topic‑as‑table abstraction, and schema handling.
Key platform features include catalog‑level management of Kafka clusters, treating topics as streaming tables, and automatic schema registration. Users can create and maintain Kafka tables in a metadata center, then query them via Flink without dealing with low‑level details.
Operationally, the team faced challenges such as cluster pressure from massive topics, I/O spikes, duplicate consumption when using multiple sinks, and latency spikes caused by shared network switches. Solutions involved topic‑level data sharding, dynamic routing rules, isolation of compute and storage clusters, and dedicated network paths for real‑time and batch workloads.
A monitoring system was built to surface cluster health, topic statistics, and Flink job metrics (input bandwidth, TPS, latency, lag). This enables rapid diagnosis of abnormal consumption patterns and cluster‑wide issues.
The Q&A section discusses data reliability in real‑time warehouses, learning from production problems, and mechanisms for detecting and handling anomalous Kafka records.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.