Migrating Spark Shuffle Service from ESS to RSS (Celeborn) at Zhihu: Design, Implementation, and Benefits
This article details Zhihu's migration of massive Spark and MapReduce shuffle workloads from the External Shuffle Service (ESS) to a push‑based Remote Shuffle Service (RSS) powered by Celeborn, covering background problems, evaluation of open‑source implementations, deployment architecture, encountered issues, solutions, performance gains, and future plans.
Zhihu runs a large Hadoop cluster where daily Spark jobs generate over 3 PB of shuffle data, with some stages reaching 100 TB. The existing External Shuffle Service (ESS) provides stability but suffers from massive random I/O and network connections, leading to disk IOPS bottlenecks and unstable job performance.
To address these limitations, Zhihu evaluated push‑based shuffle solutions, focusing on two open‑source projects: Apache Uniffle (from Tencent) and Apache Celeborn (from Alibaba). Benchmarking with a TPC‑DS 3000sf dataset showed both RSS implementations outperform ESS, with Celeborn offering lower memory consumption.
After testing, Zhihu chose Celeborn for production deployment. The architecture places a Celeborn Worker on each Hadoop node, isolates its disks from ESS, and performs a phased gray‑scale migration of Spark jobs, adding necessary Celeborn parameters via an automated Spark‑job‑parameter‑optimization service.
During migration, several issues were encountered and resolved:
OutOfMemoryError caused by excessive native thread creation in Kyuubi‑managed Spark drivers; the limit was raised by adjusting Systemd's DefaultTasksMax configuration.
Spark jobs ignored Celeborn parameters due to the spark.executor.userClassPathFirst=true setting, which caused class‑loader mismatches; removing this flag restored proper shuffle selection.
GlobalLimit operators generated huge shuffle writes to a single RSS node; a rule inserting a LocalLimit before GlobalLimit reduced data volume and prevented node exclusion.
High node load caused shuffle timeouts; timeouts were increased, buffer sizes enlarged, and disk‑load‑aware slot scheduling was introduced.
Worker connection spikes were mitigated by capping the maximum workers per job to 500, configured with celeborn.master.slot.assign.maxWorkers 500 .
Post‑migration results show RSS consumes only one‑sixth of the cluster’s disk resources while handling over one‑third of shuffle traffic. Job latency improved by more than 30 % at the 99th percentile, and overall CPU and memory usage decreased despite continuous addition of new jobs.
Future work includes extending Celeborn to Spark SQL ETL jobs, supporting MapReduce workloads, fully decommissioning ESS to free NodeManager memory, and enhancing Celeborn’s compatibility with high‑load nodes.
References: "Spark+Celeborn: Faster, More Stable, More Elastic" and "Magnet: Push‑based Shuffle Service for Large‑scale Data Processing".
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.