Big Data 15 min read

Understanding Shuffle in Spark: From Native Shuffle to External and Remote Shuffle Services (Uniffle)

This article examines the critical role of shuffle in big‑data processing, compares Spark's native shuffle with the External Shuffle Service (ESS) and Remote Shuffle Service (RSS) solutions, introduces Uniffle's architecture and configuration, and shares practical deployment experiences and performance results within a 360 internal environment.

360 Smart Cloud

Jul 9, 2024

Understanding Shuffle in Spark: From Native Shuffle to External and Remote Shuffle Services (Uniffle)

Shuffle is a core operator in big‑data computation, directly affecting the efficiency and stability of frameworks like Spark. The article first outlines the limitations of Spark's native shuffle, where Executors manage both map output and reduce fetch, leading to potential timeouts and data loss.

To address these issues, Spark introduced the External Shuffle Service (ESS), especially when integrated with YARN. ESS runs as an auxiliary service on each NodeManager, storing shuffle data on local disks and allowing reducers to fetch data directly, improving stability and scalability. However, ESS suffers from problems such as storage‑compute coupling, IO amplification, high connection counts, and single‑point‑of‑failure risks, which are exacerbated in cloud environments with limited or no local disks.

As a remedy, the community developed Remote Shuffle Service (RSS) to decouple shuffle data from NodeManagers. Several mature RSS implementations exist, including Uber's Zeus, ByteDance's CloudShuffleService, Tencent's Uniffle, and Alibaba's Celeborn. The article focuses on Uniffle, adopted by 360 due to its active community and dual support for Spark and MapReduce.

Uniffle's architecture consists of a Coordinator, Shuffle Servers, and a remote storage system (e.g., HDFS). The shuffle workflow involves the driver requesting a shuffle server, tasks writing key‑value pairs to buffers, flushing to queues, and finally persisting data to storage. Metadata is sent back for verification, and tasks can later read data from shuffle servers or HDFS.

Implementation details include configuration files for both MapReduce and Spark. Example configurations are shown below:

mapreduce.rss.control.enabled=<false>
mapreduce.rss.qilin.url=<THE_URL_OF_QILIN_API>
mapreduce.rss.coordinator.quorum=<COORDINATOR_IP_1>:<19999>,<COORDINATOR_IP_2>:<19999>
mapreduce.rss.other_keys=other_values
yarn.app.mapreduce.am.command-opts=org.apache.hadoop.mapreduce.v2.app.<RssMRAppMaster>
mapreduce.job.<map>.output.collector.<class>=org.apache.hadoop.mapred.<RssMapOutputCollector>
mapreduce.job.<reduce>.shuffle.consumer.plugin.<class>=org.apache.hadoop.mapreduce.task.<reduce>.<RssShuffle>

spark.rss.control.enabled=<false>
spark.rss.qilin.url=THE_URL_OF_QILIN_API
spark.rss.coordinator.quorum=COORDINATOR_IP_1:<19999>,COORDINATOR_IP_2:<19999>
spark.rss.other_keys=other_values
spark.shuffle.manager=org.apache.spark.shuffle.RssShuffleManager

Decision making for enabling RSS is performed via a custom API. The request body sent by Spark jobs looks like:

{
    "appId": "application_1690548620601_377655",
    "queue": "test_queue",
    "appType": "SPARK",
    "userName": "test_user",
    "appTags": ["k1:v1","k2:v2","subClusterId:yarn_bj",...,"kn:vn"],
    "appName": "my_test_job_name"
}

The corresponding response indicates whether RSS should be enabled and which cluster to use:

{
    "rssEnable": true,
    "destClusterId": "yarn_bj_online"
}

Based on the response, the client adjusts submission parameters, and YARN Router routes the job to the appropriate cluster with RSS enabled or disabled. Similar logic is applied for MapReduce jobs by modifying job‑submitter configurations.

Deployment results across three baseline clusters show that RSS reduces job failure rates and execution time significantly—stable completion for 10‑iteration Spark jobs improves to about 60% of the original duration, and overall shuffle data is effectively decoupled from NodeManager disks.

In conclusion, both Uniffle and Celeborn meet 360's functional requirements, but Uniffle currently offers better alignment with the company's needs, especially regarding support for external storage and flexibility in handling large‑scale shuffle workloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data YARN Spark Shuffle Remote Shuffle Service External Shuffle Service Uniffle

Written by

360 Smart Cloud

Official service account of 360 Smart Cloud, dedicated to building a high-quality, secure, highly available, convenient, and stable one‑stop cloud service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.