Big Data 9 min read

Custom Flink Scheduler Enhancements: Resource Balancing, Task Migration, and TmRestart Strategy

The article details Dewu’s custom Flink scheduler, DwScheduler, which adds JSON‑based resource specifications, per‑TaskManager slot sharing for balanced CPU use, hot TaskManager migration callbacks, and a new TmRestart strategy for rapid pod‑process recovery, offering practical techniques to enhance real‑time stream processing stability and performance.

DeWu Technology
DeWu Technology
DeWu Technology
Custom Flink Scheduler Enhancements: Resource Balancing, Task Migration, and TmRestart Strategy

With the rapid development of big‑data technologies, real‑time processing has become increasingly critical. Apache Flink, a leading stream‑processing framework, is widely adopted for its powerful capabilities. This article shares Dewu's exploration and practice of Flink's core, focusing on deep optimizations and customizations that improve efficiency and stability.

Reader Benefits

Deep understanding of Flink's core architecture and runtime mechanisms.

Practical optimization experience, including parameter tuning and kernel customization.

Solutions to common Flink issues and techniques for handling complex scenarios.

Real‑time processing case studies demonstrating high‑performance data handling.

Best‑practice recommendations to avoid pitfalls and boost development efficiency.

Custom Features

1. Self‑Developed Scheduler (DwScheduler)

DwScheduler integrates the advantages of community schedulers and adds features tailored to Dewu's production environment. It establishes a direct link between JobGraph and resources, enabling JSON‑based resource specifications and dynamic scaling.

SchedulerNG (interface)
   |
   +-- SchedulerBase (implements SchedulerNG)
         |
         +-- DefaultScheduler (extends SchedulerBase) // default resource scheduler
               |
               +-- AdaptiveBatchScheduler (extends DefaultScheduler) // adaptive batch scheduler
                     |
                     +-- SpeculativeScheduler (extends AdaptiveBatchScheduler) // speculative execution
               |
               +-- DwScheduler (extends DefaultScheduler) // custom scheduler
   |
   +-- AdaptiveScheduler (implements SchedulerNG) // adaptive scheduler

2. Simplified Resource Scheduling

Jobs (SQL/DataStream) are compiled into a JSON representation of the stream graph and resource profile. Users can edit this JSON directly or use a UI to configure operator parallelism, SlotSharingGroup, and resource parameters (CPU, memory, off‑heap). The system automatically calculates off‑heap settings to reduce OOM risk.

3. Balanced Task Scheduling

DwScheduler introduces a DwSlotSharingStrategy that allocates tasks per TaskManager rather than per slot, achieving a more even CPU distribution across TaskManagers. Benchmarks show significant reduction in CPU imbalance and improved throughput.

4. TaskManager Hot Migration

The scheduler provides six callback interfaces that allow services to trigger hot migration of TaskManagers without affecting the scheduling flow. This enables rapid migration (1‑5 seconds) of hotspot or faulty nodes.

default void preRequestResource() {}

default void postRequestResource(Throwable throwable) {}

default void preRestart() {}

default void postRestart(Throwable throwable) {}

default void preDeploy() {}

default void postDeploy(Throwable throwable) {}

5. TmRestart Strategy

Beyond the community's FullRestart and RegionRestart, Dewu adds a TmRestart strategy that restarts the Tm pod's main process (a resident shell) when a cancel operation exceeds a timeout or when resource leaks are detected. It also allows dynamic adjustment of JVM parameters based on failure analysis.

Overall, the article presents a comprehensive set of enhancements to Flink's scheduling, resource management, and restart mechanisms, offering practical guidance for building robust real‑time data pipelines.

performance optimizationApache Flinkresource managementSchedulerStreamingTask Migration
DeWu Technology
Written by

DeWu Technology

A platform for sharing and discussing tech knowledge, guiding you toward the cloud of technology.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.