Big Data 21 min read

Comprehensive Guide to Spark Performance Optimization, Data Skew Mitigation, and Troubleshooting

This article presents a detailed collection of Spark performance‑tuning techniques—including submit‑script parameters, RDD and operator optimizations, parallelism and memory settings, broadcast variables, Kryo serialization, locality wait adjustments—as well as systematic methods for detecting and resolving data skew and common runtime issues such as shuffle failures, serialization errors, and JVM memory problems.

Big Data Technology Architecture

Aug 24, 2021

Comprehensive Guide to Spark Performance Optimization, Data Skew Mitigation, and Troubleshooting

The document begins with a production‑grade /usr/local/spark/bin/spark-submit command template, illustrating typical flags such as --num-executors, --driver-memory, --executor-cores, and YARN‑specific options like spark.yarn.executor.memoryOverhead and spark.core.connection.ack.wait.timeout.

RDD Optimizations include reusing RDDs to avoid duplicate computation, persisting frequently accessed RDDs, applying early filter operations, and preferring mapPartitions over map when a single database connection per partition suffices. The guide also recommends foreachPartition for bulk database writes and using coalesce / repartition to balance partition sizes.

Parallelism Tuning follows Spark’s recommendation of setting the number of tasks to 2‑3 times the total CPU cores, e.g.,

val conf = new SparkConf().set("spark.default.parallelism", "500")

. Adjusting spark.locality.wait (e.g., val conf = new SparkConf().set("spark.locality.wait", "6")) can improve task placement.

Broadcast Variables are explained as a way to share large read‑only data across executors, reducing memory overhead compared to sending the variable with every task.

Kryo Serialization is recommended over Java serialization for a 10× speed boost; custom registrators are shown:

public class MyKryoRegistrator implements KryoRegistrator {
  @Override
  public void registerClasses(Kryo kryo) {
    kryo.register(StartupReportLogs.class);
  }
}
val conf = new SparkConf().set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
conf.set("spark.kryo.registrator", "atguigu.com.MyKryoRegistrator")

JVM and Memory Settings cover reducing cache memory fraction ( spark.storage.memoryFraction), enabling unified memory management, and increasing executor off‑heap memory via --conf spark.yarn.executor.memoryOverhead=2048. Adjusting connection wait times and shuffle buffer sizes ( spark.shuffle.file.buffer, spark.reducer.maxSizeInFlight, spark.shuffle.io.maxRetries, spark.shuffle.io.retryWait) are also discussed.

Data Skew Detection and Solutions describe identifying skewed tasks through Spark UI logs, then applying techniques such as:

Increasing shuffle buffer and retry parameters.

Adjusting map‑side buffer size ( spark.shuffle.file.buffer).

Using reduceByKey for map‑side pre‑aggregation.

Changing key granularity or filtering problematic keys.

Applying random prefixes to keys for double aggregation.

Converting reduce joins to map joins with broadcast variables.

Sampling skewed keys and handling them separately.

Expanding keys (N‑fold) when appropriate.

Troubleshooting includes controlling reduce‑side buffer size to avoid OOM, handling serialization errors by ensuring all custom classes are serializable, avoiding NULL returns in operator functions, and addressing YARN‑client vs YARN‑cluster mode network traffic spikes. JVM PermSize adjustments (

--conf spark.driver.extraJavaOptions="-XX:PermSize=128M -XX:MaxPermSize=256M"

) and splitting complex SQL statements to prevent stack overflow are also recommended.

Finally, the article explains the use of RDD checkpointing as a reliable fallback when cached data is lost, noting the trade‑off of additional HDFS writes.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

serialization JVM tuning Data Skew Spark Shuffle

Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.