Tag

Shuffle Optimization

1 views collected around this technical thread.

Data Thinking Notes
Data Thinking Notes
Oct 27, 2022 · Big Data

Boost Spark Performance: Proven Code Optimizations & Tuning Tips

This article outlines practical Spark job optimization techniques—from code-level improvements and resource tuning to data skew handling, persistence strategies, shuffle reduction, broadcast variables, Kryo serialization, and efficient data structures—demonstrating how each can dramatically cut execution time.

Kryo SerializationPerformance TuningRDD Persistence
0 likes · 19 min read
Boost Spark Performance: Proven Code Optimizations & Tuning Tips
DataFunTalk
DataFunTalk
Apr 28, 2021 · Big Data

Accelerating Apache Spark 3.0 with NVIDIA RAPIDS: Architecture, Performance Gains, and New Features

This article explains how NVIDIA's RAPIDS Accelerator leverages GPUs to speed up Apache Spark 3.0 workloads, detailing the underlying architecture, benchmark results on TPC‑DS and recommendation models, required configuration changes, supported operators, shuffle optimizations, and the enhancements introduced in versions 0.2 and 0.3.

Apache SparkGPU accelerationNVIDIA
0 likes · 19 min read
Accelerating Apache Spark 3.0 with NVIDIA RAPIDS: Architecture, Performance Gains, and New Features
JD Tech
JD Tech
Feb 8, 2021 · Big Data

JD Remote Shuffle Service: Design, Implementation, and Performance Evaluation

This article presents JD's self‑developed Remote Shuffle Service for Spark, detailing its architecture, goals, implementation details, performance benchmarks, and real‑world production case studies that demonstrate its impact on shuffle efficiency and system stability in large‑scale data processing.

Remote Shuffle ServiceShuffle OptimizationSpark
0 likes · 17 min read
JD Remote Shuffle Service: Design, Implementation, and Performance Evaluation
JD Retail Technology
JD Retail Technology
Jan 19, 2021 · Big Data

Design, Implementation, and Performance Evaluation of JD's Remote Shuffle Service for Spark

This article describes JD's research and production deployment of a self‑developed Remote Shuffle Service for Spark, covering its motivations, architectural design, cloud‑native features, monitoring, performance benchmarks against external shuffle solutions, and a real‑world promotion‑period case study that demonstrates improved stability and resource efficiency.

Remote Shuffle ServiceShuffle OptimizationSpark
0 likes · 17 min read
Design, Implementation, and Performance Evaluation of JD's Remote Shuffle Service for Spark
DataFunTalk
DataFunTalk
Nov 13, 2019 · Big Data

ByteDance’s Core Optimization Practices on Spark SQL

ByteDance’s data warehouse team shares comprehensive optimizations for Spark SQL, covering architecture overview, bucket join enhancements, materialized columns and views, and shuffle stability and performance improvements, illustrating practical techniques that boost query efficiency and job reliability in large‑scale big‑data environments.

Materialized ColumnsShuffle Optimizationbig data
0 likes · 20 min read
ByteDance’s Core Optimization Practices on Spark SQL
Liulishuo Tech Team
Liulishuo Tech Team
Jun 12, 2018 · Big Data

Highlights from Spark+AI Summit 2018: Hydrogen, MLflow, Delta, Spark 2.3, and Shuffle Optimization

The 2018 Spark+AI Summit in San Francisco showcased Spark's evolution toward unified AI and big‑data processing, introducing the Hydrogen project with gang scheduling, the open‑source MLflow platform, the Delta unified analytics engine, Spark 2.3 enhancements, and Facebook's shuffle I/O optimizations.

Delta LakeHydrogenMLflow
0 likes · 8 min read
Highlights from Spark+AI Summit 2018: Hydrogen, MLflow, Delta, Spark 2.3, and Shuffle Optimization
Architecture Digest
Architecture Digest
May 25, 2016 · Big Data

Advanced Spark Performance Optimization: Data Skew and Shuffle Tuning

This article provides a comprehensive guide on tackling Spark performance bottlenecks by diagnosing data skew, locating the offending stages and operators, and applying a range of practical solutions—including Hive pre‑processing, key filtering, shuffle parallelism, two‑stage aggregation, map‑join, and combined strategies—followed by an in‑depth discussion of shuffle manager evolution and key configuration parameters for fine‑tuning.

Data SkewPerformance TuningShuffle Optimization
0 likes · 35 min read
Advanced Spark Performance Optimization: Data Skew and Shuffle Tuning