Real‑time Risk Control Practices at NetEase Games Using Apache Flink
The article details NetEase Games' challenges in payment‑environment risk control and explains how they transformed a T+1 batch workflow into a fully real‑time risk‑control system with Apache Flink, describing the platform architecture, data modeling, session windows, joins, and future development plans.
NetEase Games' core online‑gaming business relies on a stable and reliable payment process; the end‑to‑end flow involves the client, channel (e.g., Alipay), billing system, and game servers, generating many cross‑service requests and massive data that make real‑time risk control challenging.
To monitor and troubleshoot this complex scenario, the concept of a "risk‑control business session" is introduced: a user‑initiated action that spans multiple systems and requests, requiring reconstruction of the entire session for analysis.
Traditional log aggregation (ELK) and nightly Spark jobs (T+1) were insufficient for timely detection. By adopting Apache Flink, NetEase built a zero‑intrusion, cross‑data‑source, real‑time risk‑control engine that processes up to 30 billion events per day.
Key Flink features used include reliable stateful computation (At‑Least‑Once, Exactly‑Once), fault recovery, TTL, and stream‑batch integration. Business sessions are modeled as a baseline (e.g., order ID) plus supplemental data (e.g., SN, product ID) forming clusters that are joined in real time.
Session windows with timeouts (event‑time session windows) detect when a user stops interacting, while Event‑Time Interval Joins handle out‑of‑order and delayed data across heterogeneous sources. Supplemental data are looked up via async I/O from external stores such as TiDB or Redis.
The platform continuously ingests data from multiple sources, normalizes them, applies dynamic rule‑based joins via Flink broadcast streams, tags sessions, and writes results to an HTAP database for SQL‑based analysis, funnel charts, heatmaps, and AIGC‑enhanced insights.
Real‑time micro‑session queries can retrieve millisecond‑level details, filter by user ID, device, or payment outcome, and present risk labels (e.g., payment failure). The macro view aggregates session knowledge clusters for overall business health monitoring.
Future directions include supporting ad‑hoc Flink‑SQL queries on risk results, tighter feedback loops for SRE/operations to improve models, and deeper integration of AIGC for automated analysis and recommendation.
Overall, the system enables rapid, automated detection and investigation of payment‑related risks, reducing manual effort from days to minutes.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.