Big Data 13 min read

How FlinkSQL Optimizations Cut CPU Usage by Up to 60% in Streaming Jobs

This article details the FlinkSQL performance enhancements implemented by the streaming team, covering view reuse, redundant shuffle removal, multiple‑input operator redesign, long sliding‑window optimizations, and native JSON format improvements, which together deliver up to 60% CPU savings and massive core‑hour reductions.

ByteDance Data Platform

Aug 20, 2024

How FlinkSQL Optimizations Cut CPU Usage by Up to 60% in Streaming Jobs

Background and Motivation

Amid a push for cost reduction and efficiency, the streaming team optimized FlinkSQL to meet higher performance demands. Flink SQL is widely used internally, with over 30,000 streaming tasks consuming millions of cores.

1. Engine Optimizations

View Reuse

In streaming SQL, common logic is placed in logical views, which are not materialized. Multiple downstream branches cause the view’s computation to be duplicated, leading to unnecessary resource consumption. The root cause is that Calcite converts each view query into a separate RelNode tree, preventing reuse.

Solution: Modify Calcite’s SqlToRel conversion to return a LogicalTableScan referencing the catalog view, and store the CatalogView so downstream operators share the same RelNode tree. This eliminates duplicate computation and yields a 20% CPU gain.

Remove Redundant Streaming Shuffle

Streaming shuffle (exchange) incurs serialization and network overhead. When two operators share the same hash key, a second shuffle is unnecessary.

Approach: Extend batch‑shuffle optimization rules to streaming, checking node distribution requirements and inserting exchanges only when needed. This reduces extra hash shuffles, achieving a 24% CPU improvement.

Streaming Multiple‑Input Operator

To avoid multiple shuffles in join‑agg‑union patterns, the team introduced a MultipleInput operator that merges several inputs into a single operator, handling state, timers, watermarks, and checkpoints collectively. This optimization provides a 10% CPU gain and resolves memory‑pressure issues.

Long Sliding‑Window Optimization

Long sliding windows (e.g., 7‑day or 30‑day windows) suffer from high overhead due to frequent pane merges. By storing intermediate results in global state and adding a retractMerge method to aggregation functions, the system can drop old panes and merge only new ones, reducing pane processing by up to 66% and delivering a 60% CPU reduction.

2. Data‑Format Optimizations

Native JSON Format

Approximately 13,000 tasks (≈70% of core usage) rely on JSON deserialization. The team switched to a vectorized C++ JSON parser (sonic‑cpp) and generated BinaryRowData directly, bypassing generic serialization. Tests show a 57% CPU gain for native JSON handling.

3. Tooling, Optimization, and Platform Layers

The tool layer provides real‑time SQL task metadata reporting, operator‑level monitoring, DAG compatibility checks, prioritized gray‑scale rollout, and data‑accuracy tracing, ensuring stable and accurate deployment of optimizations.

The optimization layer scales proven improvements across existing jobs and explores new techniques, while the engine & platform layer collaborates with product teams to enable default activation of vetted optimizations, delivering over 100,000 core‑hour performance gains.

Future work will continue to refine FlinkSQL, explore optimal state usage in joins, and advance native engine and batch‑stream fusion.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data FlinkSQL View Reuse CPU Reduction Long Sliding Window Shuffle Elimination Streaming Optimization

Written by

ByteDance Data Platform

The ByteDance Data Platform team empowers all ByteDance business lines by lowering data‑application barriers, aiming to build data‑driven intelligent enterprises, enable digital transformation across industries, and create greater social value. Internally it supports most ByteDance units; externally it delivers data‑intelligence products under the Volcano Engine brand to enterprise customers.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.