Big Data 20 min read

Large‑Scale Evolution of Spark Shuffle Cloud‑Native Architecture at ByteDance

This article details ByteDance's extensive migration of Spark Shuffle to a cloud‑native architecture, describing the massive data volumes, the underlying ESS and CSS services, the challenges of resource isolation, monitoring, throttling, spill‑splitting, and the performance gains achieved across stable and mixed‑resource clusters.

DataFunTalk
DataFunTalk
DataFunTalk
Large‑Scale Evolution of Spark Shuffle Cloud‑Native Architecture at ByteDance

Background : ByteDance runs Spark at massive scale, processing over 150 million daily tasks and generating more than 100 PB of shuffle data per day, with individual shuffle volumes reaching hundreds of terabytes, creating significant performance challenges.

Evolution : Starting in early 2021, ByteDance began cloud‑native migration of Spark Shuffle, moving from Yarn auxiliary services to a Kubernetes‑based Godel scheduler and redesigning the ESS (Executor Shuffle Service) as a DaemonSet.

Challenges : Strict CPU limits on DaemonSets required continuous resource tuning. Pod memory limits prevented effective page‑cache usage, increasing shuffle read latency. Mixed‑resource clusters suffered from high fetch‑failure rates due to limited I/O capacity.

Solutions for Stable Clusters : Enhanced ESS monitoring and governance, adding metrics such as Queued Chunks and Chunk Fetch Rate, integrated into ByteDance's metrics system. Implemented shuffle throttling per application to cap fetch requests when node latency exceeds thresholds, with priority‑aware flow control. Introduced shuffle spill‑splitting to limit per‑executor write size and balance data distribution using Godel scheduling.

Mixed‑Resource Cluster Optimizations : Developed Cloud Shuffle Service (CSS) offering a push‑based shuffle model, partition grouping, and dual‑write in‑memory replication for high availability. CSS includes a cluster manager for load‑balanced worker allocation and a shuffle client that handles automatic failover and deduplication.

Results : After deployment, high‑priority clusters saw shuffle fetch failure rates drop below 1/10 000, latency improvements, and a 30 %+ query performance boost in TPC‑DS benchmarks; CSS further reduced shuffle I/O pressure and improved reliability.

Conclusion : The cloud‑native redesign of Spark Shuffle at ByteDance, combining ESS enhancements, throttling, spill‑splitting, and the CSS remote service, significantly increased stability and performance for both stable and mixed‑resource environments.

Cloud NativePerformance Optimizationbig dataSparkshuffleByteDance
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.