Big Data 16 min read

Flink Real-time Task Resource Optimization Practice at Youzan

At Youzan, Flink real‑time tasks running on Kubernetes are optimized by daily GC‑log memory analysis and Kafka‑throughput monitoring, which compute recommended heap sizes and parallelism adjustments to eliminate over‑provisioned CPU and memory, automate alerts, and pave the way for fully automated resource tuning.

Youzan Coder
Youzan Coder
Youzan Coder
Flink Real-time Task Resource Optimization Practice at Youzan

As more Flink real-time tasks run on Kubernetes clusters at Youzan, the need for efficient resource allocation has become critical. While Flink Kubernetes deployment enhances elastic scaling capabilities during peak periods and reduces operational costs, users often over-configure resources due to limited experience, leading to significant waste of computing resources.

This article explores Flink task resource optimization from two perspectives: memory analysis and message processing capability.

1. Flink Computing Resource Types

Flink tasks require five types of resources: memory, local disk storage, external storage resources (HDFS, S3, HBase, MySQL, Redis), CPU, and network card resources. Currently, memory and CPU are the primary resources used, while the others rarely become bottlenecks.

2. Resource Optimization Approach

The optimization strategy involves two main approaches: analyzing heap memory through GC logs and evaluating message processing capability to ensure CPU resources are reasonably allocated.

Memory Analysis via GC Logs

The analysis uses GC Viewer to examine GC logs, extracting key metrics including heap size, young generation, old generation allocation, and remaining space after Full GC. According to Java performance optimization guidelines, if the remaining old generation space after Full GC is M, then recommended heap size should be 3-4×M, young generation 1-1.5×M, and old generation 2-3×M.

Message Processing Capability Analysis

This involves comparing the Kafka topic input rate with the processing capability of each operator/task. The slowest operator determines the overall processing capacity. The analysis uses custom metrics to measure single-record processing time per task, retrieved through Flink Rest API. The logic determines whether to increase, decrease, or maintain current parallelism based on the comparison between output rate and input rate.

Practical Implementation

Youzan's real-time platform automatically scans running Flink tasks daily, calculates recommended heap memory based on GC logs, and alerts administrators when resource waste is detected. The platform also monitors message processing capabilities and suggests parallelism adjustments after communicating with business teams.

Future Plans

Future work includes fully automating resource optimization by analyzing historical resource usage patterns and collaborating with the metadata platform to explore additional optimization opportunities.

big dataFlinkkubernetesResource OptimizationPerformance Tuningreal-time computingGC tuning
Youzan Coder
Written by

Youzan Coder

Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.