Comprehensive Guide to Spark Performance Optimization, Data Skew Mitigation, and Troubleshooting
This article presents a detailed collection of Spark performance‑tuning techniques—including submit‑script parameters, RDD and operator optimizations, parallelism and memory settings, broadcast variables, Kryo serialization, locality wait adjustments—as well as systematic methods for detecting and resolving data skew and common runtime issues such as shuffle failures, serialization errors, and JVM memory problems.
The document begins with a production‑grade /usr/local/spark/bin/spark-submit command template, illustrating typical flags such as --num-executors , --driver-memory , --executor-cores , and YARN‑specific options like spark.yarn.executor.memoryOverhead and spark.core.connection.ack.wait.timeout .
RDD Optimizations include reusing RDDs to avoid duplicate computation, persisting frequently accessed RDDs, applying early filter operations, and preferring mapPartitions over map when a single database connection per partition suffices. The guide also recommends foreachPartition for bulk database writes and using coalesce / repartition to balance partition sizes.
Parallelism Tuning follows Spark’s recommendation of setting the number of tasks to 2‑3 times the total CPU cores, e.g., val conf = new SparkConf().set("spark.default.parallelism", "500") . Adjusting spark.locality.wait (e.g., val conf = new SparkConf().set("spark.locality.wait", "6") ) can improve task placement.
Broadcast Variables are explained as a way to share large read‑only data across executors, reducing memory overhead compared to sending the variable with every task.
Kryo Serialization is recommended over Java serialization for a 10× speed boost; custom registrators are shown: public class MyKryoRegistrator implements KryoRegistrator { @Override public void registerClasses(Kryo kryo) { kryo.register(StartupReportLogs.class); } } val conf = new SparkConf().set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") conf.set("spark.kryo.registrator", "atguigu.com.MyKryoRegistrator")
JVM and Memory Settings cover reducing cache memory fraction ( spark.storage.memoryFraction ), enabling unified memory management, and increasing executor off‑heap memory via --conf spark.yarn.executor.memoryOverhead=2048 . Adjusting connection wait times and shuffle buffer sizes ( spark.shuffle.file.buffer , spark.reducer.maxSizeInFlight , spark.shuffle.io.maxRetries , spark.shuffle.io.retryWait ) are also discussed.
Data Skew Detection and Solutions describe identifying skewed tasks through Spark UI logs, then applying techniques such as: Increasing shuffle buffer and retry parameters. Adjusting map‑side buffer size ( spark.shuffle.file.buffer ). Using reduceByKey for map‑side pre‑aggregation. Changing key granularity or filtering problematic keys. Applying random prefixes to keys for double aggregation. Converting reduce joins to map joins with broadcast variables. Sampling skewed keys and handling them separately. Expanding keys (N‑fold) when appropriate.
Troubleshooting includes controlling reduce‑side buffer size to avoid OOM, handling serialization errors by ensuring all custom classes are serializable, avoiding NULL returns in operator functions, and addressing YARN‑client vs YARN‑cluster mode network traffic spikes. JVM PermSize adjustments ( --conf spark.driver.extraJavaOptions="-XX:PermSize=128M -XX:MaxPermSize=256M" ) and splitting complex SQL statements to prevent stack overflow are also recommended.
Finally, the article explains the use of RDD checkpointing as a reliable fallback when cached data is lost, noting the trade‑off of additional HDFS writes.
Big Data Technology Architecture
Exploring Open Source Big Data and AI Technologies
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.