Big Data 19 min read

Migrating Spark Shuffle Service from ESS to RSS (Celeborn) at Zhihu: Design, Implementation, and Benefits

This article details Zhihu's migration of massive Spark and MapReduce shuffle workloads from the External Shuffle Service (ESS) to a push‑based Remote Shuffle Service (RSS) powered by Celeborn, covering background problems, evaluation of open‑source implementations, deployment architecture, encountered issues, solutions, performance gains, and future plans.

DataFunTalk

Jun 22, 2024

Migrating Spark Shuffle Service from ESS to RSS (Celeborn) at Zhihu: Design, Implementation, and Benefits

Zhihu runs a large Hadoop cluster where daily Spark jobs generate over 3 PB of shuffle data, with some stages reaching 100 TB. The existing External Shuffle Service (ESS) provides stability but suffers from massive random I/O and network connections, leading to disk IOPS bottlenecks and unstable job performance.

To address these limitations, Zhihu evaluated push‑based shuffle solutions, focusing on two open‑source projects: Apache Uniffle (from Tencent) and Apache Celeborn (from Alibaba). Benchmarking with a TPC‑DS 3000sf dataset showed both RSS implementations outperform ESS, with Celeborn offering lower memory consumption.

After testing, Zhihu chose Celeborn for production deployment. The architecture places a Celeborn Worker on each Hadoop node, isolates its disks from ESS, and performs a phased gray‑scale migration of Spark jobs, adding necessary Celeborn parameters via an automated Spark‑job‑parameter‑optimization service.

During migration, several issues were encountered and resolved:

OutOfMemoryError caused by excessive native thread creation in Kyuubi‑managed Spark drivers; the limit was raised by adjusting Systemd's DefaultTasksMax configuration.

Spark jobs ignored Celeborn parameters due to the spark.executor.userClassPathFirst=true setting, which caused class‑loader mismatches; removing this flag restored proper shuffle selection.

GlobalLimit operators generated huge shuffle writes to a single RSS node; a rule inserting a LocalLimit before GlobalLimit reduced data volume and prevented node exclusion.

High node load caused shuffle timeouts; timeouts were increased, buffer sizes enlarged, and disk‑load‑aware slot scheduling was introduced.

Worker connection spikes were mitigated by capping the maximum workers per job to 500, configured with celeborn.master.slot.assign.maxWorkers 500.

Post‑migration results show RSS consumes only one‑sixth of the cluster’s disk resources while handling over one‑third of shuffle traffic. Job latency improved by more than 30 % at the 99th percentile, and overall CPU and memory usage decreased despite continuous addition of new jobs.

Future work includes extending Celeborn to Spark SQL ETL jobs, supporting MapReduce workloads, fully decommissioning ESS to free NodeManager memory, and enhancing Celeborn’s compatibility with high‑load nodes.

References: "Spark+Celeborn: Faster, More Stable, More Elastic" and "Magnet: Push‑based Shuffle Service for Large‑scale Data Processing".

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

performance Big Data Spark Shuffle RSS celeborn

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.