Big Data 12 min read

Stability Optimization Practices for Flink Jobs at Tencent

This article presents Tencent's practical experience in improving Flink job stability, covering the Oceanus platform, stability challenges, and concrete optimization techniques such as reducing failures, minimizing impact, accelerating recovery, and proactive issue detection, followed by a summary and future outlook.

DataFunSummit

Oct 10, 2022

Stability Optimization Practices for Flink Jobs at Tencent

Speaker: Qiu Congxian, Senior Development Engineer, Tencent

Organizer: DataFunTalk

Overview: The session focuses on practical stability optimization for Flink jobs, including the following topics:

Flink usage at Tencent

Introduction to Flink stability

Stability optimization practices

Summary & outlook

01 Flink Usage at Tencent

Flink is deployed on Tencent's internal one‑stop real‑time computing platform called Oceanus. Oceanus supports Jar, SQL, and canvas jobs, runs on Yarn and Kubernetes, stores data in HDFS, receives input from MQ and outputs to MQ and various services.

Typical application scenarios of Flink at Tencent are shown in the following diagrams.

02 Flink Stability Overview

The Flink runtime consists of a Master (scheduling), a resource pool (Yarn/Kubernetes), HA managed by Zookeeper, and Tasks that execute user logic. During checkpointing, task state is backed up to HDFS.

Master handles job scheduling

Resource pool: Yarn and K8s

HA: Zookeeper

Task runs user logic

Checkpoint copies state to HDFS

Stability can be affected by third‑party systems (Yarn, HDFS, Zookeeper) and Flink itself (control chain, data chain, back‑pressure, bugs, user logic errors).

Third‑party factors: Yarn container count & resources, HDFS storage & access, Zookeeper connections

Flink factors: Master‑Worker communication, checkpoint flow, data hotspots, back‑pressure, bugs, user errors

03 Stability Optimization Practices

Optimization is divided into three aspects:

Reduce failures – e.g., improve Zookeeper HA protocol, merge small checkpoint files to reduce HDFS RPCs.

Lower impact – limit the number of affected tasks (single‑task restart, Master failover without full job restart) and reduce downtime (accelerate job start).

Fast detection & recovery – automatic diagnosis system, proactive monitoring.

(1) Reduce failures – Optimizing Zookeeper HA

Before optimization, every TaskManager kept a long‑lived connection to Zookeeper, causing high connection counts. After optimization, only the Master maintains a Zookeeper connection; TaskManagers disconnect, dramatically reducing Zookeeper load.

The master‑failover process is illustrated below.

After the new Master registers with Zookeeper, TaskManagers detect heartbeat timeout, reconnect to Zookeeper, obtain the new Master address, and then drop the Zookeeper connection.

(2) Lower impact – Single‑task restart

When a task fails, only that task is restarted while other tasks continue running. This avoids global failover and reduces data loss. Diagrams show the workflow and latency differences between global and single‑task restarts.

Experimental results indicate that single‑task restart reduces perceived data interruption from 138 seconds to near zero, though it may be slightly lossy.

(3) Fast recovery – Startup acceleration

Three bottlenecks are identified: Master RPC handling, container file fetching, and container provisioning. Optimizations include reducing unnecessary RPCs, merging small dependency files, and allowing extra standby containers. Recovery time drops from ~200 s to 48 s.

(4) Proactive issue detection

Traditional reactive alerts (heartbeat timeout, checkpoint expire) require manual log/metric inspection. Tencent built an integrated diagnostics system combining Logs, Metrics, and Traces to automatically pinpoint root causes (e.g., OOM killer, sync snapshot) and suggest remedies.

04 Summary & Outlook

We addressed stability from three angles: reducing failures, lowering impact, and fast detection & recovery. Future work includes loss‑less single‑task restart, rapid state recovery for large states, and further automation of diagnosis and remediation.

Thank you for attending.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Flink stability Real‑Time Computing Tencent

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.