Big Data 11 min read

How ByteDance’s Cloud Shuffle Service Boosts Big Data Job Stability and Performance

ByteDance’s Cloud Shuffle Service (CSS) replaces the traditional Pull‑Based Sort Shuffle in Spark, FlinkBatch and MapReduce with a Push‑Based remote shuffle that improves stability, performance and elasticity, supports compute‑storage separation, and delivers significant speedups in large‑scale TPC‑DS benchmarks.

ByteDance Cloud Native
ByteDance Cloud Native
ByteDance Cloud Native
How ByteDance’s Cloud Shuffle Service Boosts Big Data Job Stability and Performance

Cloud Shuffle Service Introduction

Pull‑Based Sort Shuffle is a common shuffle scheme used by Spark, FlinkBatch and MapReduce, but its implementation has defects that often cause job instability in large‑scale production environments. To address these issues, ByteDance developed Cloud Shuffle Service (CSS), a remote shuffle framework that offers better stability, higher performance, and greater elasticity, and also provides remote shuffle for compute‑storage separation scenarios.

On August 25, ByteDance announced the open‑source release of Cloud Shuffle Service on GitHub.

CSS is a generic Remote Shuffle Service framework supporting Spark (2.x & 3.x), FlinkBatch and MapReduce. It delivers more stable, higher‑performance, and elastic data shuffle capabilities compared with native solutions, and enables remote shuffle for compute‑storage separation and offline mixed deployments.

Problems with Pull‑Based Sort Shuffle

Merging multiple spill files into one incurs additional read/write I/O.

With m MapTasks and n ReduceTasks, m×n network connections can cause severe network congestion.

Shuffle Service cannot isolate application resources; a single abnormal job may affect all jobs on the same node.

Shuffle data files are stored locally on a single disk; disk failures lead to data loss and FetchFailed errors.

Reliance on local disks prevents compute‑storage separation.

These issues often result in slow or timed‑out ShuffleRead, FetchFailed errors, low CPU and memory utilization, and increased resource waste due to task re‑execution, especially in offline mixed or serverless cloud‑native scenarios.

ByteDance runs millions of Spark jobs daily, shuffling over 300 PB per day. In HDFS mixed and offline mixed environments, Spark job stability is frequently insufficient, affecting business SLA.

CSS Architecture

CSS Cluster is an independently deployed shuffle service consisting of CSS Workers, a CSS Master, and ZooKeeper. Workers register themselves in ZooKeeper and provide two services:

Push : MapTasks push partition data to a designated CSS Worker, storing all data for the same partition in a single file.

Fetch : ReduceTasks fetch partition data sequentially from the corresponding CSS Worker, greatly improving I/O efficiency compared with random reads in traditional ESS.

When a job starts, a CSS Master is launched in the Spark Driver. The master obtains the list of CSS Workers from ZooKeeper, assigns n replicas (default 2) of each partition to workers, and manages the metadata. ReduceTasks use this metadata to locate the appropriate workers. RegisterShuffle and UnregisterShuffle operations create and delete Znodes in ZooKeeper, and workers watch these events to clean up shuffle data.

Key Features of CSS

Multi‑engine support : Besides Spark, CSS also integrates with MapReduce and FlinkBatch.

PartitionGroup : Consecutive small partitions are combined into larger PartitionGroups for more efficient Push.

Unified memory management : CSS Buffer stores all partition data together, sorts by PartitionId before spill, and is fully managed by Spark’s UnifiedMemoryManager.

Fault tolerance :

Push failure: each Push batch is 4 MB; if a batch fails, CSS reallocates a new worker and continues pushing without affecting already‑pushed data.

Multi‑replica storage: ReduceTasks can read from alternative replicas when a worker fails.

Data deduplication: Header information (MapId, AttemptId, BatchId) allows ReduceTasks to deduplicate data when speculative execution creates multiple attempts.

AQE adaptation : CSS fully supports Adaptive Query Execution, including dynamic Reduce number adjustment, SkewJoin optimization, and join strategy selection, significantly improving performance for skewed joins.

Performance Evaluation

In a 1 TB TPC‑DS benchmark using exclusive label resources, CSS achieved an overall end‑to‑end performance improvement of about 15 %, with some queries improving up to 30 % compared with the open‑source ESS.

When the same benchmark was run on an online mixed‑resource queue (where ESS stability is poor), CSS delivered roughly a 4× speedup.

Future Plans

CSS has open‑sourced part of its features; additional features and optimizations will be released gradually:

Support for MapReduce and FlinkBatch engines.

Introduce a ClusterManager role to manage worker status and load, and delegate worker assignment from the master to the manager.

Implement heterogeneous worker allocation strategies based on machine capabilities and current load.

- END -

distributed systemsPerformance Optimizationbig dataSparkRemote ShuffleShuffle Service
ByteDance Cloud Native
Written by

ByteDance Cloud Native

Sharing ByteDance's cloud-native technologies, technical practices, and developer events.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.