Big Data 15 min read

Apache Celeborn (Incubating): Addressing Traditional Shuffle Limitations in Big Data Processing

Apache Celeborn (Incubating) is a remote shuffle service designed to overcome the inefficiencies, high storage demands, network overhead, and limited fault tolerance of traditional Spark shuffle implementations by introducing push‑shuffle, partition splitting, columnar shuffle, multi‑layer storage, and elastic, stable, and scalable architectures.

DataFunTalk
DataFunTalk
DataFunTalk
Apache Celeborn (Incubating): Addressing Traditional Shuffle Limitations in Big Data Processing

Traditional Shuffle Problems

Traditional Spark shuffle writes sorted partition data to local or cloud disks, causing five major drawbacks: reliance on large storage, high memory usage for sorting, O(m×n) network connections, excessive random disk reads, and single‑replica low fault tolerance.

Apache Celeborn (Incubating)

Celeborn is an incubating Apache project that provides an engine‑agnostic remote shuffle service, originally developed by Alibaba Cloud EMR Spark team to eliminate the need for large local disks and support future spilled‑data handling.

It originated in 2020 as Remote Shuffle Service, open‑sourced in Dec 2021, and entered Apache incubation in Oct 2022, now with over 600 commits, 32 contributors, and 330+ stars.

Celeborn’s Performance, Stability, and Elasticity

Performance

Celeborn uses a push‑shuffle with partition‑level aggregation: each mapper buffers data and pushes partitions to pre‑assigned workers, avoiding disk writes and sorting, converting random reads to sequential reads and reducing network connections from quadratic to linear.

Partition splitting prevents oversized files by dynamically creating new workers when a partition exceeds a threshold, ensuring balanced data sizes.

Columnar shuffle converts row‑oriented data to columnar format during write and back during read, leveraging code generation to keep conversion overhead below 5% and reducing shuffle size by ~40% in TPC‑DS 3T tests.

Multi‑layer storage supports memory, local disk, and distributed storage (OSS/HDFS); data is flushed from push region to appropriate layers, with eviction policies moving data between layers to maintain performance and elasticity.

Stability

In‑place upgrades are achieved via protocol forward‑compatibility and graceful restarts using hard‑split markers and state persistence in LevelDB.

Congestion control mimics TCP with slow start, congestion avoidance, and back‑off, while also supporting credit‑based flow control for Flink shuffle reads.

Load balancing monitors disk health, write speed, and capacity, assigning work to healthier, faster disks based on a grouping algorithm.

Elasticity

When integrated with Spark on Kubernetes, Celeborn enables true elastic resource release by offloading shuffle data, allowing idle pods to terminate immediately, unlike the external shuffle service which cannot fully release resources.

Typical Deployment Scenarios

Fully co‑located deployment with HDFS, YARN, and Celeborn for maximum performance and stability.

Independent Celeborn deployment with HDFS/YARN co‑located, providing resource isolation and partial elasticity.

Compute‑storage separation where data resides in object storage, compute runs on K8s or YARN, and Celeborn runs independently, delivering full elasticity.

Evaluation

In mixed‑deployment with over 1,000 nodes, Celeborn handled compressed shuffle volumes of 4 PB, supporting >80 k concurrent tasks and 16 TB jobs on HDDs with high stability.

In a storage‑compute separation case, thousands of Spark pods on K8s achieved better elasticity, performance, and stability.

TPC‑DS 3T benchmarks show Celeborn improves single‑replica shuffle performance by 20% and dual‑replica by 13% without extra resource consumption.

Overall, Celeborn addresses the inefficiencies of traditional shuffle, offering a high‑performance, stable, and elastic remote shuffle solution for big‑data workloads.

performance optimizationBig DataApache SparkCelebornRemote ShuffleShuffle Service
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.