Big Data 19 min read

Ten Gotchas When Migrating Spark Jobs to Flink

This article shares ten practical pitfalls encountered while moving hour‑level Spark session processing jobs to Apache Flink, covering parallelism skew, state TTL, checkpoint handling, logging, debugging, state migration, Reduce vs Process, input validation, event‑time handling, and the trade‑offs of storing data inside Flink.

DataFunTalk
DataFunTalk
DataFunTalk
Ten Gotchas When Migrating Spark Jobs to Flink

Robin from Contentsquare summarizes ten real‑world traps they hit when migrating Spark session‑processing jobs to Apache Flink, offering concrete advice for teams deploying Flink in production.

1. Parallelism causing load skew – Uneven key‑group distribution leads to some subtasks processing twice the data of others. The default maxParallelism (operatorParallelism + operatorParallelism/2) can exacerbate the issue; set maxParallelism to a multiple of the desired parallelism to balance load.

public static int computeOperatorIndexForKeyGroup(int maxParallelism, int parallelism, int keyGroupId) {
    return keyGroupId * parallelism / maxParallelism;
}

2. Importance of mapWithState & TTL – When dealing with unbounded keyed state, configure TTL to prevent unlimited growth. Using mapWithState hides state details and makes TTL configuration difficult, so prefer explicit state descriptors with TTL or RichMapFunction for better control.

3. Checkpoint restore and repartition – For large state (≈8 TB), use incremental checkpoints and retained checkpoints to avoid costly savepoints. Increase RocksDB checkpoint transfer threads (e.g., state.backend.rocksdb.checkpoint.transfer.thread.num: 8 ) to speed up releases.

state.backend.rocksdb.checkpoint.transfer.thread.num: 8
state.backend.rocksdb.thread.num: 8

4. Add proactive logging – Log long‑running windows (e.g., >1 min) to capture problematic data without overwhelming the system.

5. Identify what a stuck job is doing – Enable JMX on TaskManagers, forward port 1099, and use jconsole to inspect thread stacks, quickly pinpointing the blocking method.

env.java.opts: "-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.local.only=false -Dcom.sun.management.jmxremote.port=1099 -Dcom.sun.management.jmxremote.rmi.port=1099 -Djava.rmi.server.hostname=127.0.0.1"
kubectl port-forward flink-taskmanager-4 1099
jconsole 127.0.0.1:1099

6. Risks of migrating state between descriptors – Moving data from a window‑content state to a historical session state can cause OOM if old events are merged twice; use PurgingTrigger to clear window content before migration.

// Purging the window's content allows us to receive late events without merging them twice with the old session
val sessionWindows = keyedStream
  .window(EventTimeSessionWindows.withGap(Time.minutes(30)))
  .allowedLateness(Time.days(7))
  .trigger(PurgingTrigger.of(EventTimeTrigger.create()))

7. Reduce vs Process – ReducingState aggregates on‑the‑fly with smaller state, while ListState stores all elements for later processing; Flink’s RocksDB can merge ListState without deserialization, making it faster for this use case.

8. Never trust input data – Filter invalid, duplicate, or skewed records early; enforce limits such as a maximum of 300 page‑view events per session.

9. Dangers of event‑time processing – Slow upstream partitions cause late data to be treated as late arrivals, increasing processing cost. Solutions include aligning upstream partitioning with downstream or batching late events using custom triggers.

10. Avoid storing everything in Flink – Keep only hot, frequently accessed data in Flink state; offload large, cold data to external stores (e.g., Aerospike) to reduce checkpoint size and improve fault‑tolerance.

Overall, the authors recommend joining the Flink community mailing list and note that Alibaba Cloud’s real‑time computing platform builds on Apache Flink for enterprise‑grade streaming analytics.

performanceBig DataFlinkState Managementstreamingcheckpointing
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.