StreamflySQL: NetEase Games’ Journey from Template JAR to SQL Gateway for Flink SQL Platformization
This article details NetEase Games’ evolution of its Flink SQL platform, from the early StreamflySQL v1 template‑JAR approach to the v2 SQL‑Gateway architecture, discussing design decisions, challenges such as metadata persistence, multi‑tenant security, horizontal scaling, and job state management.
With the maturation of streaming SQL theory, providing a SQL development experience for real‑time stream computing comparable to offline batch processing has become feasible. Major platforms such as GCP Dataflow, Apache Flink, Apache Kafka, and Apache Pulsar have added SQL support. Among open‑source solutions, Flink SQL is the most popular, but it lacks a Hive Server2‑like service component, leading each company to implement its own platformization. This article introduces NetEase Games’ exploration and practice of Flink SQL platformization.
Development History
NetEase Games’ real‑time computing platform, named Streamfly, originated in 2019 as a subsystem called Lambda under the offline job platform Omega. Lambda supported Storm, Spark Streaming, and Flink, was written in Golang, and used dynamically generated shell scripts to invoke framework CLIs, providing flexibility for multiple Flink versions.
In late 2019, the first attempt, StreamflySQL v1, used a generic template JAR combined with SQL job configuration to run Flink SQL, because the Golang‑based Streamfly could not call Flink client APIs directly.
Due to poor user experience and immature Flink SQL features, StreamflySQL v1 saw limited adoption, prompting a redesign in late 2020 that resulted in StreamflySQL v2.
StreamflySQL v1 (Template JAR Based)
Implementation
StreamflySQL v1 consisted of three modules: a Flink template JAR, a backend configuration center, and a frontend SQL editor. The job submission flow involved the frontend requesting metadata, the backend forwarding requests to distributed metadata services, the user writing SQL, the backend creating a job via Lambda API, and the Flink client executing the job in per‑job mode.
Key pain points included:
Slow response: each SQL job required launching a Flink client process, submitting a YARN application, and waiting for container allocation, often taking 1‑2 minutes, or over 5 minutes for complex queries.
Difficult debugging: debugging used a PrintSink and a local MiniCluster, which isolated results but suffered from long optimization times, inability to debug long‑window jobs, and lack of streaming result delivery.
Limited to single DML statements: only insert into ... select ... was supported; DDL, DSL, and DCL statements such as create table , select , grant , or set could not be executed because the environment lifecycle was tied to the Flink client process.
StreamflySQL v2 (SQL Gateway Based)
Implementation
To overcome v1’s shortcomings, the team adopted Ververica’s Flink SQL Gateway as the core, integrating it into a SpringBoot service. The gateway provides a REST‑based SQL interface similar to Spark Thrift Server, but required enhancements for production use.
Key architectural decisions:
Metadata persistence: session and job metadata are stored in a database; the Flink environment is rebuilt from the DB if missing.
Runtime configuration overrides: per‑session configuration (e.g., cluster ID) merges defaults from FLINK_CONF_DIR with DB‑stored values.
Multi‑tenant authentication: Kerberos is used across NetEase components. The solution employs Hadoop proxy‑user impersonation, logging in as a super‑user, obtaining delegation tokens, and executing as the impersonated user.
Horizontal scaling: the service is stateless, allowing scaling, while a session‑affinity load balancer routes requests with the same session ID to the same instance to preserve TCP connections for select DSL results.
Job state management: a custom JobManager archive lookup retrieves the latest checkpoint or restored checkpoint for state migration, rather than relying on Lambda’s per‑job checkpoint strategy.
Challenges and Solutions
Metadata Persistence
Since the open‑source gateway does not persist session metadata, the team wrapped it as a service and stored metadata in a relational database, using the local Flink environment as a cache.
Multi‑Tenant Security
Static Hadoop UserGroupInformation made per‑session isolation difficult; the proxy‑user approach proved more feasible, allowing delegation token‑based authentication without sharing Kerberos TGTs.
Horizontal Expansion
Stateless design enables scaling, but session affinity is required to keep select result streams attached to the correct instance.
Job State Management
The platform implements a JobManager‑archive‑based checkpoint lookup to restore job state, handling both completed and restored checkpoints.
Future Outlook
Planned improvements include state migration support for altered SQL, finer‑grained resource management per job, and contributing generic enhancements back to the Flink community (e.g., FLIP‑91).
References
FLIP‑74: Flink JobClient API
Flink SQL 1.11 on Zeppelin platform practice
Flink SQL Gateway
FLIP‑91: Support SQL Client Gateway
NetEase Game Operations Platform
The NetEase Game Automated Operations Platform delivers stable services for thousands of NetEase titles, focusing on efficient ops workflows, intelligent monitoring, and virtualization.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.