Big Data 21 min read

Large‑Scale Evolution of Spark Shuffle Cloud‑Native Architecture at ByteDance

This article details ByteDance's large‑scale evolution of Spark Shuffle to a cloud‑native architecture, describing background, stability and mixed‑resource scenarios, challenges such as CPU and I/O limits, custom ESS enhancements, shuffle throttling, spill‑split mechanisms, and the Cloud Shuffle Service with its push‑based design and performance gains.

DataFunSummit
DataFunSummit
DataFunSummit
Large‑Scale Evolution of Spark Shuffle Cloud‑Native Architecture at ByteDance

Background ByteDance runs Spark at massive scale, processing over 150 million daily tasks and generating more than 100 PB of shuffle data, with individual tasks sometimes reaching hundreds of TB. The high volume and diverse storage media (SSD, HDD, hybrid) create significant shuffle performance challenges.

Principles In the community ESS mode, Spark shuffle partitions data from M mappers into R reducers, performing a Shuffle Write phase where each mapper writes partitioned data to local disks, followed by a Shuffle Read phase where reducers fetch the required partitions from all ESS nodes, causing M × R network connections and intensive disk I/O.

Challenges in Cloud‑Native Migration Moving from Yarn NM to Kubernetes DaemonSet imposed strict CPU limits on ESS, requiring continuous resource tuning. Pod memory limits reduced page‑cache effectiveness, leading to low cache hit rates and increased I/O overhead. To mitigate these issues, ByteDance introduced CPU‑share mechanisms, adjusted pod resource policies, and opened page‑cache usage.

Custom ESS Enhancements To improve stability, ByteDance added detailed shuffle monitoring (queued chunks, chunk fetch rate) integrated with internal metrics, and UI features showing slow stages and top offending tasks. Governance is performed via the BatchBrain system, which collects event logs, timeline events, and custom shuffle metrics for offline and real‑time analysis, enabling automated parameter tuning and alerting.

Shuffle Throttling A throttling mechanism limits the total fetch requests per application on an ESS node. When latency exceeds thresholds, the node evaluates running applications, allocates flow based on priority, and rejects excess fetches until load normalizes, preventing single applications from monopolizing resources.

Spill‑Split Functionality To address data skew where a few executors write disproportionate shuffle data, ByteDance limits per‑executor shuffle write size, excludes overloaded executors, and uses the Godel scheduler to balance executor placement, reducing node overload and fetch failures.

Cloud Shuffle Service (CSS) For mixed‑resource clusters, CSS provides a remote push‑based shuffle service with partition grouping, dual‑write in‑memory replication, and load‑balanced workers managed by a Cluster Manager. The Shuffle Client integrates with Spark, offering asynchronous writes, automatic failover, and deduplication, improving shuffle latency and throughput.

Performance and Future Evolution In TPC‑DS benchmarks, CSS achieved over 30 % query performance improvement. The service is open‑source, supports elastic deployment, and will continue to evolve with cloud‑native capabilities such as remote storage integration.

Conclusion The comprehensive cloud‑native transformation of Spark shuffle at ByteDance, encompassing monitoring, throttling, spill‑split, and the CSS architecture, significantly enhances stability and performance for both stable and mixed‑resource workloads.

Cloud NativePerformance OptimizationBig DataKubernetesSparkshuffle
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.