Big Data 16 min read

Flink Real-time Task Resource Optimization Practice at Youzan

At Youzan, Flink real‑time tasks running on Kubernetes are optimized by daily GC‑log memory analysis and Kafka‑throughput monitoring, which compute recommended heap sizes and parallelism adjustments to eliminate over‑provisioned CPU and memory, automate alerts, and pave the way for fully automated resource tuning.

Youzan Coder

Jan 13, 2021

Flink Real-time Task Resource Optimization Practice at Youzan

As more Flink real-time tasks run on Kubernetes clusters at Youzan, the need for efficient resource allocation has become critical. While Flink Kubernetes deployment enhances elastic scaling capabilities during peak periods and reduces operational costs, users often over-configure resources due to limited experience, leading to significant waste of computing resources.

This article explores Flink task resource optimization from two perspectives: memory analysis and message processing capability.

1. Flink Computing Resource Types

Flink tasks require five types of resources: memory, local disk storage, external storage resources (HDFS, S3, HBase, MySQL, Redis), CPU, and network card resources. Currently, memory and CPU are the primary resources used, while the others rarely become bottlenecks.

2. Resource Optimization Approach

The optimization strategy involves two main approaches: analyzing heap memory through GC logs and evaluating message processing capability to ensure CPU resources are reasonably allocated.

Memory Analysis via GC Logs

The analysis uses GC Viewer to examine GC logs, extracting key metrics including heap size, young generation, old generation allocation, and remaining space after Full GC. According to Java performance optimization guidelines, if the remaining old generation space after Full GC is M, then recommended heap size should be 3-4×M, young generation 1-1.5×M, and old generation 2-3×M.

Message Processing Capability Analysis

This involves comparing the Kafka topic input rate with the processing capability of each operator/task. The slowest operator determines the overall processing capacity. The analysis uses custom metrics to measure single-record processing time per task, retrieved through Flink Rest API. The logic determines whether to increase, decrease, or maintain current parallelism based on the comparison between output rate and input rate.

Practical Implementation

Youzan's real-time platform automatically scans running Flink tasks daily, calculates recommended heap memory based on GC logs, and alerts administrators when resource waste is detected. The platform also monitors message processing capabilities and suggests parallelism adjustments after communicating with business teams.

Future Plans

Future work includes fully automating resource optimization by analyzing historical resource usage patterns and collaborating with the metadata platform to explore additional optimization opportunities.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Flink Kubernetes resource optimization Performance tuning Real-Time Computing GC tuning

Written by

Youzan Coder

Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.