Why Your Spark Batch Job Fails: Memory Limits, Data Skew, and Practical Fixes
This article examines a recurring Spark batch task failure caused by OutOfMemory errors and data skew, details the investigation steps—including increasing executor memory, raising parallelism, and analyzing shuffle metrics—and proposes solutions such as data validation, filtering oversized keys, and memory adjustments.
1. Problem Description
In the past two days, batch task A frequently fails, with error logs as follows; the task computes association metrics for dimension A (involving aggregation operations).
2. Investigation
2.1 Increase Executor Memory
Based on the error “OutOfMemory: Java heap space”, it was intuitive to think the executor memory was insufficient for the data volume; the executor memory was increased from 10 GB to 20 GB and the task restarted to observe the effect.
To ensure stability, the task was launched repeatedly; the success rate improved but failures still occurred, indicating memory contributes but is not the root cause.
2.2 Increase Task Parallelism
To reduce the data processed by each executor, the task parallelism was increased and the effect observed.
Repeated launches still showed failures, suggesting the errors are likely caused by certain keys with excessively large data during aggregation.
2.3 Check Data Skew
Examine Spark UI task execution overview metrics:
Task execution shows clear indicators:
From the Shuffle Read Size/Records metrics, despite similar record counts, the size differs dramatically (140 MB vs 5 MB), indicating some keys have overly large associated data.
Analyzing the secondary dimension B, it was found that a single A dimension is linked to 17,768,523 B records. This A dimension belongs to internal test data and can be filtered.
3. Resolution
(1) Add validation for data effectiveness.
(2) Validate and filter oversized thresholds in dimension association statistics.
4. Recommendations
(1) Increasing memory can alleviate OutOfMemory issues but may not address the root cause.
(2) The interface does not enforce strict validation on user‑submitted data, leading to abnormal data; introduce unified data quality checks and cleaning during statistical analysis to avoid distorted results and wasted resources.
Data Thinking Notes
Sharing insights on data architecture, governance, and middle platforms, exploring AI in data, and linking data with business scenarios.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.