How EMR Serverless Storage Cuts Costs up to 55% for Shuffle‑Heavy Spark Jobs
A performance comparison of Amazon EMR Serverless Storage on a 3 TB TPC‑DS benchmark shows up to 55 % cost reduction and 25 % faster runtimes for shuffle‑intensive Spark jobs, while outlining usage limits and providing Python tools to analyze shuffle data from Spark event logs.
Spark jobs need temporary storage for shuffle data, and Dynamic Resource Allocation (DRA) struggles when the external shuffle service is unavailable in environments such as k8s or EMR Serverless, leading to inefficient executor release.
Amazon EMR Serverless Storage, introduced in EMR 7.12+, decouples remote shuffle from executors, allowing DRA to work more efficiently and enabling faster resource release.
Benchmark Setup
We evaluated Amazon EMR Serverless Storage using the TPC‑DS 3 TB benchmark (105 SQL queries). Two EMR Serverless applications were run: one with Serverless Storage enabled and one without. Environment details:
EMR Serverless version: 7.12.0 (arm64, us-east-1)
Dataset: TPCDS‑3TB
Driver: 4 Cores, 4 GiB memory
Executor: dynamicAllocation.initialExecutors 3, 4 Cores, 8 GiB memory
Storage: the non‑Serverless application used the default 20 GB per executor (free); the Serverless Storage application required no explicit storage configuration.
Results
Overall cost reduction of 15.5 % with comparable runtime.
For the 20 queries whose shuffle data ranged from 10 GB to 100 GB, average cost saving was 13.32 % and runtime decreased by 6.5 % .
For the 3 queries with 100 GB–200 GB shuffle data, average cost saving was 55.16 % and runtime decreased by 25.35 % .
Queries with less than 10 GB of shuffle data showed no clear cost or performance advantage.
These results indicate that EMR Serverless Storage delivers cost and performance benefits when shuffle data exceeds 10 GB, with larger gains for shuffle‑intensive workloads.
Limitations
As of 2025‑12‑12, Serverless Storage is supported only on EMR Serverless (EMR 7.12+); it is not available on EMR on EC2 or EMR on EKS.
Each job can store a maximum of 200 GB of intermediate results; jobs exceeding this limit fail.
Worker configurations of 1 or 2 vCPU are not supported.
Shuffle Data Extraction Tool
Shuffle size can be obtained by parsing Spark event logs stored in S3. Example configuration for EMR Serverless:
"s3MonitoringConfiguration": {"logUri": "s3://xxxx/logs/spark-event-log"}A Python script analyze_spark_shuffle.py parses the event logs and aggregates shuffle read/write bytes and records. A simplified excerpt of the script:
#!/usr/bin/env python3
"""Spark Event Log Shuffle Analyzer"""
import argparse, json, boto3, csv, logging
# ... (argument parsing, S3 listing, log parsing) ...The script can be executed in parallel using the --threads option, e.g.:
uv run python analyze_spark_shuffle.py \
--event-log-base-path s3://xxxxx/spark-event-log/ \
--application-id xxxxx \
--job-ids xxxx,xxxx \
--threads 5MCP Integration
An MCP definition allows the analyzer to be invoked from tools such as Kiro CLI, producing an HTML report:
{
"mcpServers": {
"spark-eventlog": {
"type": "stdio",
"command": "uvx",
"args": ["--from","git+https://github.com/yhyyz/spark-eventlog-mcp","spark-eventlog-mcp"],
"env": {"MCP_TRANSPORT": "stdio"}
}
}
}Using this MCP, users can obtain a comprehensive analysis of any Spark event log.
Summary
EMR Serverless Storage is best suited for shuffle‑heavy Spark jobs (>10 GB), with the most pronounced savings for workloads exceeding 100 GB of shuffle data. For smaller shuffle volumes, traditional storage may be more economical. The provided Python script and MCP enable rapid assessment of shuffle size and migration suitability.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Amazon Cloud Developers
Official technical community of Amazon Cloud. Shares practical AI/ML, big data, database, modern app development, IoT content, offers comprehensive learning resources, hosts regular developer events, and continuously empowers developers.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
