Big Data 13 min read

Why Spark 3.2 OOMs After Upgrade: Deep Dive into AQE and StageMetrics

After upgrading Spark from 3.0.1 to 3.2.1 an ETL job began failing with OutOfMemory errors; this article examines the root causes, including AQE‑related metric accumulation, skipped stages, and stage‑metric growth, and presents a debugging process and a code‑level fix to mitigate memory pressure.

GuanYuan Data Tech Team
GuanYuan Data Tech Team
GuanYuan Data Tech Team
Why Spark 3.2 OOMs After Upgrade: Deep Dive into AQE and StageMetrics

Problem Background

After upgrading Spark from 3.0.1 to 3.2.1, an ETL job started failing with OutOfMemory errors that did not occur before the upgrade.

Driver logs showed occasional OutOfMemory exceptions.

Problem Analysis

OOM issues can be investigated by analyzing dump files or monitoring memory usage. The focus is on identifying objects consuming large memory and reproducing the issue.

Investigation Process

Issue Confirmation

Running the problematic ETL alone confirmed memory growth leading to service unavailability, consistent with previous memory monitoring.

Comparing scripts before and after the upgrade showed no changes, so rolling back Spark version allowed the ETL to run successfully.

Hypotheses

1. Whether increasing

Spark.scheduler.listenerbus.eventqueue.capacity

caused extra memory pressure. Logs showed no dropped events, so this was excluded.

2. Changes in Spark 3.2 code logic. Using JProfile on the dump revealed that

SQLAppStatusListener

held large amounts of memory, especially the

stageMetric

objects stored in a

ConcurrentHashMap

.

[SPARK-33016][SQL] Potential SQLMetrics missed which might cause WEB UI display issue while AQE is on
So decided to make a trade off of keeping more duplicate SQLMetrics without deleting them when AQE with newPlan updated.

Reproduction Attempts

Running the ETL with empty data locally did not reproduce the issue, indicating data dependence. Observations of the original job showed many jobs with dependencies, a large final job with thousands of tasks, and many joins/unions.

Many jobs with dependencies.

Final job contains tens of thousands of tasks.

ETL includes many join and union operations.

Increasing data volume and partition count reproduced frequent GC and near‑OOM states, with heap dumps showing

SQLAppStatusListener

dominating memory usage.

Further Analysis

Comparing Spark 3.0.1 and 3.2.1 revealed that the newer version generated roughly twice as many

StageMetrics

and many skipped stages, increasing memory consumption.

Metrics are added by

SparkListenerSQLAdaptiveExecutionUpdate

and

SparkListenerSQLAdaptiveSQLMetricUpdates

. The newer version triggers these events more often and with more metrics.

Code inspection shows that

StageMetric

initialization occurs on

SparkListenerJobStart

and updates on

onStageSubmitted

. Skipped stages inflate the

stageMetrics

map.

<code>private val stageMetrics = new ConcurrentHashMap[Int, LiveStageMetrics]()

override def onStageSubmitted(event: SparkListenerStageSubmitted): Unit = {
  if (!isSQLStage(event.stageInfo.stageId)) {
    return
  }
  Option(stageMetrics.get(event.stageInfo.stageId)).foreach { stage =>
    if (stage.attemptId != event.stageInfo.attemptNumber) {
      stageMetrics.put(event.stageInfo.stageId,
        new LiveStageMetrics(event.stageInfo.stageId, event.stageInfo.attemptNumber,
          stage.numTasks, stage.accumIdsToMetricType))
    }
  }
}
</code>

Removing skipped stages at

JobEnd

reduced memory pressure, allowing the ETL to complete without frequent full GC.

Conclusion

The OOM was linked to AQE‑related metric accumulation and the handling of skipped stages. Disabling AQE is not ideal, but cleaning up skipped stage data mitigates memory pressure. Ongoing work will continue to investigate AQE corner cases to ensure stability of enterprise‑grade Spark services.

big dataSparkOutOfMemoryAQEStageMetrics
GuanYuan Data Tech Team
Written by

GuanYuan Data Tech Team

Practical insights from the GuanYuan Data Tech Team

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.