Why Spark 3.2 OOMs After Upgrade: Deep Dive into AQE and StageMetrics
After upgrading Spark from 3.0.1 to 3.2.1 an ETL job began failing with OutOfMemory errors; this article examines the root causes, including AQE‑related metric accumulation, skipped stages, and stage‑metric growth, and presents a debugging process and a code‑level fix to mitigate memory pressure.
Problem Background
After upgrading Spark from 3.0.1 to 3.2.1, an ETL job started failing with OutOfMemory errors that did not occur before the upgrade.
Driver logs showed occasional OutOfMemory exceptions.
Problem Analysis
OOM issues can be investigated by analyzing dump files or monitoring memory usage. The focus is on identifying objects consuming large memory and reproducing the issue.
Investigation Process
Issue Confirmation
Running the problematic ETL alone confirmed memory growth leading to service unavailability, consistent with previous memory monitoring.
Comparing scripts before and after the upgrade showed no changes, so rolling back Spark version allowed the ETL to run successfully.
Hypotheses
1. Whether increasing
Spark.scheduler.listenerbus.eventqueue.capacitycaused extra memory pressure. Logs showed no dropped events, so this was excluded.
2. Changes in Spark 3.2 code logic. Using JProfile on the dump revealed that
SQLAppStatusListenerheld large amounts of memory, especially the
stageMetricobjects stored in a
ConcurrentHashMap.
[SPARK-33016][SQL] Potential SQLMetrics missed which might cause WEB UI display issue while AQE is on
So decided to make a trade off of keeping more duplicate SQLMetrics without deleting them when AQE with newPlan updated.
Reproduction Attempts
Running the ETL with empty data locally did not reproduce the issue, indicating data dependence. Observations of the original job showed many jobs with dependencies, a large final job with thousands of tasks, and many joins/unions.
Many jobs with dependencies.
Final job contains tens of thousands of tasks.
ETL includes many join and union operations.
Increasing data volume and partition count reproduced frequent GC and near‑OOM states, with heap dumps showing
SQLAppStatusListenerdominating memory usage.
Further Analysis
Comparing Spark 3.0.1 and 3.2.1 revealed that the newer version generated roughly twice as many
StageMetricsand many skipped stages, increasing memory consumption.
Metrics are added by
SparkListenerSQLAdaptiveExecutionUpdateand
SparkListenerSQLAdaptiveSQLMetricUpdates. The newer version triggers these events more often and with more metrics.
Code inspection shows that
StageMetricinitialization occurs on
SparkListenerJobStartand updates on
onStageSubmitted. Skipped stages inflate the
stageMetricsmap.
<code>private val stageMetrics = new ConcurrentHashMap[Int, LiveStageMetrics]()
override def onStageSubmitted(event: SparkListenerStageSubmitted): Unit = {
if (!isSQLStage(event.stageInfo.stageId)) {
return
}
Option(stageMetrics.get(event.stageInfo.stageId)).foreach { stage =>
if (stage.attemptId != event.stageInfo.attemptNumber) {
stageMetrics.put(event.stageInfo.stageId,
new LiveStageMetrics(event.stageInfo.stageId, event.stageInfo.attemptNumber,
stage.numTasks, stage.accumIdsToMetricType))
}
}
}
</code>Removing skipped stages at
JobEndreduced memory pressure, allowing the ETL to complete without frequent full GC.
Conclusion
The OOM was linked to AQE‑related metric accumulation and the handling of skipped stages. Disabling AQE is not ideal, but cleaning up skipped stage data mitigates memory pressure. Ongoing work will continue to investigate AQE corner cases to ensure stability of enterprise‑grade Spark services.
GuanYuan Data Tech Team
Practical insights from the GuanYuan Data Tech Team
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.