Big Data 19 min read

NetEase Game Streaming ETL Architecture and Practices Based on Flink

This article presents NetEase Game's streaming ETL solution built on Flink, covering business background, log characteristics, specialized and generic ETL services, architectural evolution, Python UDF integration, runtime optimizations, fault‑tolerance mechanisms, and future roadmap for unified real‑time and offline data warehouses.

Architecture Digest

Jun 10, 2021

NetEase Game Streaming ETL Architecture and Practices Based on Flink

NetEase Game's data integration relies on a streaming ETL pipeline powered by Flink, transforming heterogeneous game logs—operational, business, and program logs—into structured data for both real‑time and offline warehouses.

The system handles challenges such as schema‑free sources (e.g., MongoDB), deeply nested fields, high log variety, and frequent schema changes, requiring flexible parsing, transformation, and robust error handling.

Three ETL services are offered: a dedicated operational‑log ETL with custom logic, the generic EntryX ETL for all other text logs, and ad‑hoc jobs for special cases. EntryX defines Source, StreamingTable, and Sink modules, automatically generating Flink jobs from user configurations.

Architectural evolution progressed from Hadoop Streaming (Python scripts) to Spark Streaming (POC) and finally to Flink DataStream, preserving Python UDFs via a Jython‑based Runner layer that executes cross‑language functions within the JVM.

Runtime optimizations include hot‑updating lightweight configuration changes, consolidating multiple stream tables into a single Flink job to avoid redundant Kafka reads, and separating real‑time and offline sinks to mitigate HDFS back‑pressure.

Further performance tuning addresses HDFS small‑file explosion by pre‑partitioning streams (keyBy) and limiting parallelism, while SLA metrics are collected via OperatorState‑based utilities supporting static, dynamic, and TTL metrics.

Fault tolerance is achieved using SideOutput for error streams, with downstream recovery strategies involving batch reprocessing or targeted back‑fill jobs that replace corrupted Hive partitions.

Future plans focus on data‑lake support for update/delete workloads, automatic small‑file merging and deduplication, and extending Python support to the full Flink stack via PyFlink.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data data pipeline Flink Real-time Data Warehouse Log Processing streaming ETL python udf

Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.