Building and Evolving NetEase Yanxuan Real-Time Computing Platform: Architecture, SQLization, Serviceization, and Data Governance
This article details NetEase Yanxuan's real-time computing platform development from 2017 to present, covering its architecture, Flink‑SQL development environment, service‑oriented deployment, resource optimization, cloud‑native migration, comprehensive data governance, and future plans for stream‑batch integration and intelligent job diagnostics.
NetEase Yanxuan, a brand e‑commerce platform, requires highly reliable real‑time data processing for scenarios such as real‑time data warehouses, risk control, and business monitoring.
Background and Evolution – Starting in 2017, the platform began its journey toward a unified real‑time system. Key milestones include the launch of Streaming SQL in 2018, Flink‑on‑K8s serviceization in 2019, a focus on governance in 2020, and recent explorations of batch‑stream convergence.
Current Platform Status – Over 5,000 tasks run daily with peak throughput of ~20 million events per second and end‑to‑end latency at the second level, supporting dashboards, risk algorithms, log monitoring, and APM alerts.
Architecture – The stack consists of an infrastructure layer (Kafka, Pulsar, Yarn/K8s, storage), a service abstraction layer that hides Flink‑task details behind REST/RPC APIs, and a platform layer offering development, operations, monitoring, metadata management, and lineage tools.
SQL‑Based Real‑Time Tasks – To lower the development barrier, the Atom IDE provides an out‑of‑the‑box Flink‑SQL environment with unified metadata, a UDF repository, and extensions such as connectors (MySQL, PostgreSQL, TiDB, ES), dimension‑table caching, window triggers, and enhanced DDL.
Task Submission and Debugging – SQL is compiled into a JobGraph and submitted to the cluster, with a debug mode that rewrites SQL, samples online data, and streams results via WebSocket or file output for rapid iteration.
Flink Serviceization – Tasks are managed through a service layer exposing start/stop, status, checkpoint, and savepoint operations via REST/RPC, supporting multiple Flink versions, pluggable Yarn/K8s clusters, and automatic failure recovery.
Resource Optimization – The platform moves from per‑Job isolation to a session‑based model enhanced with resource‑strategy pools, achieving a balance between isolation and high resource utilization.
Cloud‑Native Deployment – Migration from Yarn to Kubernetes provides true cgroup isolation, rapid scaling, node‑selector scheduling, ingress‑exposed REST services, HA via Zookeeper, sidecar‑based logging, and Service‑Mesh integration.
Data Governance – Comprehensive monitoring uses OpenTSDB for metrics at operator granularity, unified data lineage derived from SQL parse trees or JobGraph DAGs, and full‑chain governance that classifies tables (hot/cold) and optimizes task resources based on intelligent diagnostics.
Future Plans – The roadmap includes stream‑batch integration with Iceberg data lake support, and an intelligent job‑diagnosis service that automatically applies optimization recommendations.
Q&A Highlights – Discussed Flink sink idempotency (retract/upsert modes) and the high demand for debugging features, which replace sources with sampled Kafka data and sinks with WebSocket/file outputs.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.