Understanding Shuffle in Spark: From Native Shuffle to External and Remote Shuffle Services (Uniffle)
This article examines the critical role of shuffle in big‑data processing, compares Spark's native shuffle with the External Shuffle Service (ESS) and Remote Shuffle Service (RSS) solutions, introduces Uniffle's architecture and configuration, and shares practical deployment experiences and performance results within a 360 internal environment.
Shuffle is a core operator in big‑data computation, directly affecting the efficiency and stability of frameworks like Spark. The article first outlines the limitations of Spark's native shuffle, where Executors manage both map output and reduce fetch, leading to potential timeouts and data loss.
To address these issues, Spark introduced the External Shuffle Service (ESS), especially when integrated with YARN. ESS runs as an auxiliary service on each NodeManager, storing shuffle data on local disks and allowing reducers to fetch data directly, improving stability and scalability. However, ESS suffers from problems such as storage‑compute coupling, IO amplification, high connection counts, and single‑point‑of‑failure risks, which are exacerbated in cloud environments with limited or no local disks.
As a remedy, the community developed Remote Shuffle Service (RSS) to decouple shuffle data from NodeManagers. Several mature RSS implementations exist, including Uber's Zeus, ByteDance's CloudShuffleService, Tencent's Uniffle, and Alibaba's Celeborn. The article focuses on Uniffle, adopted by 360 due to its active community and dual support for Spark and MapReduce.
Uniffle's architecture consists of a Coordinator, Shuffle Servers, and a remote storage system (e.g., HDFS). The shuffle workflow involves the driver requesting a shuffle server, tasks writing key‑value pairs to buffers, flushing to queues, and finally persisting data to storage. Metadata is sent back for verification, and tasks can later read data from shuffle servers or HDFS.
Implementation details include configuration files for both MapReduce and Spark. Example configurations are shown below:
mapreduce.rss.control.enabled=
mapreduce.rss.qilin.url=
mapreduce.rss.coordinator.quorum=
:<19999>,
:<19999>
mapreduce.rss.other_keys=other_values
yarn.app.mapreduce.am.command-opts=org.apache.hadoop.mapreduce.v2.app.
mapreduce.job.
.output.collector.
=org.apache.hadoop.mapred.
mapreduce.job.
.shuffle.consumer.plugin.
=org.apache.hadoop.mapreduce.task.
. spark.rss.control.enabled=
spark.rss.qilin.url=THE_URL_OF_QILIN_API
spark.rss.coordinator.quorum=COORDINATOR_IP_1:<19999>,COORDINATOR_IP_2:<19999>
spark.rss.other_keys=other_values
spark.shuffle.manager=org.apache.spark.shuffle.RssShuffleManagerDecision making for enabling RSS is performed via a custom API. The request body sent by Spark jobs looks like:
{
"appId": "application_1690548620601_377655",
"queue": "test_queue",
"appType": "SPARK",
"userName": "test_user",
"appTags": ["k1:v1","k2:v2","subClusterId:yarn_bj",...,"kn:vn"],
"appName": "my_test_job_name"
}The corresponding response indicates whether RSS should be enabled and which cluster to use:
{
"rssEnable": true,
"destClusterId": "yarn_bj_online"
}Based on the response, the client adjusts submission parameters, and YARN Router routes the job to the appropriate cluster with RSS enabled or disabled. Similar logic is applied for MapReduce jobs by modifying job‑submitter configurations.
Deployment results across three baseline clusters show that RSS reduces job failure rates and execution time significantly—stable completion for 10‑iteration Spark jobs improves to about 60% of the original duration, and overall shuffle data is effectively decoupled from NodeManager disks.
In conclusion, both Uniffle and Celeborn meet 360's functional requirements, but Uniffle currently offers better alignment with the company's needs, especially regarding support for external storage and flexibility in handling large‑scale shuffle workloads.
360 Smart Cloud
Official service account of 360 Smart Cloud, dedicated to building a high-quality, secure, highly available, convenient, and stable one‑stop cloud service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.