Big Data 13 min read

Why Spark 3.2 OOMs After Upgrade: Deep Dive into AQE and StageMetrics

After upgrading Spark from 3.0.1 to 3.2.1 an ETL job began failing with OutOfMemory errors; this article examines the root causes, including AQE‑related metric accumulation, skipped stages, and stage‑metric growth, and presents a debugging process and a code‑level fix to mitigate memory pressure.

GuanYuan Data Tech Team

Jun 30, 2022

Why Spark 3.2 OOMs After Upgrade: Deep Dive into AQE and StageMetrics

Problem Background

After upgrading Spark from 3.0.1 to 3.2.1, an ETL job started failing with OutOfMemory errors that did not occur before the upgrade.

Driver logs showed occasional OutOfMemory exceptions.

Problem Analysis

OOM issues can be investigated by analyzing dump files or monitoring memory usage. The focus is on identifying objects consuming large memory and reproducing the issue.

Investigation Process

Issue Confirmation

Running the problematic ETL alone confirmed memory growth leading to service unavailability, consistent with previous memory monitoring.

Comparing scripts before and after the upgrade showed no changes, so rolling back Spark version allowed the ETL to run successfully.

Hypotheses

1. Whether increasing Spark.scheduler.listenerbus.eventqueue.capacity caused extra memory pressure. Logs showed no dropped events, so this was excluded.

2. Changes in Spark 3.2 code logic. Using JProfile on the dump revealed that SQLAppStatusListener held large amounts of memory, especially the stageMetric objects stored in a ConcurrentHashMap.

[SPARK-33016][SQL] Potential SQLMetrics missed which might cause WEB UI display issue while AQE is on

So decided to make a trade off of keeping more duplicate SQLMetrics without deleting them when AQE with newPlan updated.

Reproduction Attempts

Running the ETL with empty data locally did not reproduce the issue, indicating data dependence. Observations of the original job showed many jobs with dependencies, a large final job with thousands of tasks, and many joins/unions.

Many jobs with dependencies.

Final job contains tens of thousands of tasks.

ETL includes many join and union operations.

Increasing data volume and partition count reproduced frequent GC and near‑OOM states, with heap dumps showing SQLAppStatusListener dominating memory usage.

Further Analysis

Comparing Spark 3.0.1 and 3.2.1 revealed that the newer version generated roughly twice as many StageMetrics and many skipped stages, increasing memory consumption.

Metrics are added by SparkListenerSQLAdaptiveExecutionUpdate and SparkListenerSQLAdaptiveSQLMetricUpdates. The newer version triggers these events more often and with more metrics.

Code inspection shows that StageMetric initialization occurs on SparkListenerJobStart and updates on onStageSubmitted. Skipped stages inflate the stageMetrics map.

private val stageMetrics = new ConcurrentHashMap[Int, LiveStageMetrics]()

override def onStageSubmitted(event: SparkListenerStageSubmitted): Unit = {
  if (!isSQLStage(event.stageInfo.stageId)) {
    return
  }
  Option(stageMetrics.get(event.stageInfo.stageId)).foreach { stage =>
    if (stage.attemptId != event.stageInfo.attemptNumber) {
      stageMetrics.put(event.stageInfo.stageId,
        new LiveStageMetrics(event.stageInfo.stageId, event.stageInfo.attemptNumber,
          stage.numTasks, stage.accumIdsToMetricType))
    }
  }
}

Removing skipped stages at JobEnd reduced memory pressure, allowing the ETL to complete without frequent full GC.

Conclusion

The OOM was linked to AQE‑related metric accumulation and the handling of skipped stages. Disabling AQE is not ideal, but cleaning up skipped stage data mitigates memory pressure. Ongoing work will continue to investigate AQE corner cases to ensure stability of enterprise‑grade Spark services.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Spark OutOfMemory AQE StageMetrics

Written by

GuanYuan Data Tech Team

Practical insights from the GuanYuan Data Tech Team

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.