Apache Flink Practice at NetEase: Architecture, Scale, and Future Directions
This article details NetEase's evolution from Storm to Flink for real‑time computing, describing the Sloth platform's architecture, large‑scale deployment, diverse business scenarios, monitoring, alerting, and future development plans, illustrating how Flink powers data synchronization, real‑time warehousing, and e‑commerce analytics and recommendation.
NetEase originally used Storm for real‑time tasks such as email anti‑spam, advertising, and news recommendation, but has been migrating many jobs to Flink.
Flink was chosen for its high throughput, low latency, checkpointing, exactly‑once semantics, and event‑time support, leading to the creation of the Sloth project in 2017, a SQL‑based real‑time platform built on Apache Flink.
After initial challenges with platformization and heavy code modifications, Sloth was rebuilt in early 2019 using Flink 1.7 and Blink SQL, allowing users to submit SQL jobs or write Java code directly.
To consolidate resources, NetEase integrated various business units (e.g., NetEase Cloud Music, Yanxuan, Media) into a unified real‑time platform, with the research institute handling the underlying platform and APIs while business teams focus on their logic.
Currently, the platform runs over a thousand tasks, utilizes more than 20,000 vCores and 80 TB of memory, and supports scenarios such as advertising, e‑commerce dashboards, ETL, data analysis, recommendation, risk control, search, and live streaming.
The platform architecture evolved through three major versions:
Sloth 0.x: custom SQL implementation due to limited Flink SQL support, tightly coupled with offline platform.
Sloth 1.0: introduced plugin‑based Flink management, parent‑child process model, support for jar tasks, Blink SQL integration, Grafana monitoring, and a custom time‑series database (Ntsdb).
Sloth 2.0: transformed into a PaaS, distributed system with API‑driven front‑ends, multi‑version Flink support (1.5, 1.7, 1.9, Blink), and high‑availability via Nginx load balancing.
The platform consists of several modules: Sloth‑Server (user requests, validation, task submission), Sloth‑Kernel (executes shell scripts on the cluster), Sloth‑Admin (confirms execution, monitors YARN), and supporting components such as HDFS, Nginx, Zookeeper, Kafka, Elasticsearch, and the custom Ntsdb.
Event management ensures exclusive task operations using a distributed lock coordinated among Server, Kernel, and Admin, with high‑availability achieved through horizontal scaling and hot‑standby mechanisms.
Kernel scheduling relies on a parent‑child process architecture, supporting both resident and temporary processes for task execution and SQL debugging.
Task development UI offers debugging, syntax checking, metadata management, resource file handling, versioning, and more, while Blink SQL extends support for dimension joins and various sinks (HDFS, Kafka, HBase, ES, Ntsdb, Kudu).
Monitoring uses InfluxDB and the custom Ntsdb, visualized via Grafana, with alerting based on metrics such as task failures, latency, and custom QPS thresholds, delivering notifications through internal chat, email, SMS, etc.
Key business use cases include:
Real‑time data synchronization for AI dialogue services, writing results to Elasticsearch.
Real‑time data warehouse pipelines: ingest logs to Kafka, process with Flink, store aggregates in Redis and Kudu.
E‑commerce data analysis: real‑time activity, funnel, and profit calculations, feeding Kudu for analyst queries.
E‑commerce search and recommendation: real‑time user and product feature extraction, CTR/CVR estimation, and UV/PV statistics, stored in Redis for online serving.
Future directions focus on supporting Flink on Kubernetes, automatic resource configuration based on workload, intelligent diagnostics for UDFs and job failures, continued enhancements to Flink SQL and batch‑stream convergence, and deeper community involvement.
Speaker: Wu Liangbo, NetEase Java technology expert, responsible for the real‑time computing platform.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.