Big Data 15 min read

Understanding Data Skew and Its Mitigation in Hive and Spark

This article explains the concept of data skew, its symptoms such as slow tasks and OOM errors, and provides comprehensive mitigation techniques and configuration examples for Hive and Spark, including custom partitioning, map joins, adaptive execution, and key detection methods.

JD Retail Technology

Apr 14, 2023

Understanding Data Skew and Its Mitigation in Hive and Spark

Data skew occurs when a large number of identical keys are assigned to a single partition, causing one node to be overloaded while other nodes remain idle, which defeats the purpose of parallel computation.

Typical manifestations include most tasks finishing quickly while a few run extremely slowly, sudden OOM exceptions in Spark jobs, and job progress stuck at 99% or 100% due to uneven task workloads.

General solutions include increasing JVM memory, adding more reducers, implementing a custom partitioner, prefixing keys with a random value, using a combiner for local aggregation, redesigning keys, and avoiding the use of rand() in joins.

Hive‑specific remedies: enable map‑side aggregation ( hive.map.aggr=true), turn on automatic skew handling ( set hive.groupby.skewindata=true), use map‑join for small tables (

set hive.auto.convert.join=true;

hive.mapjoin.smalltable.filesize=25000000

), rewrite joins to filter nulls, avoid select *, prefer sort by over order by, and replace count(distinct) with sum(1) patterns. Example map‑join hint:

select /*+mapjoin(b)*/ a.field1 as field1, b.field2 as field2, b.field3 as field3 from a left join b on a.field1 = b.field1;

Spark‑specific remedies: enable adaptive execution ( spark.sql.adaptive.enabled=true), turn on skewed join handling ( spark.sql.adaptive.skewedJoin.enabled=true or spark.sql.adaptive.skewJoin.enabled=true), increase broadcast join threshold ( spark.sql.autoBroadcastJoinThreshold=524288000), enable verbose shuffle statistics ( spark.shuffle.statistics.verbose=true), and adjust skew detection parameters such as

spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=256MB

and spark.sql.adaptive.skewJoin.skewedPartitionFactor=5. For extreme cases, enable inflation detection ( spark.sql.adaptive.skewJoin.inflation.enabled=true, spark.sql.adaptive.skewJoin.inflation.factor=50) and skew detection ( spark.sql.adaptive.shuffle.detectSkewness=true). DataFrame example for two‑stage aggregation:

dataframe.groupBy(col("key"), pmod(hash(col("some_col")), 100)).agg(max("value").as("partial_max")).groupBy(col("key")).agg(max("partial_max").as("max"))

Skew detection methods: examine Spark UI or YARN logs to locate the problematic stage, count key frequencies via SQL (e.g.,

select key, count(*) from table group by key order by count(*) desc limit N

), or use RDD sampling techniques that assign a count of 1 to each key, aggregate, swap key/value, and sort to find hot keys.

Additional recommendations include monitoring data distribution across dates and key dimensions, adding data‑quality checks at each processing layer, performing health‑score assessments for resources and parameters, and handling hot keys separately (e.g., extracting them to a dimension table and processing them with random distribution).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Hive SQL optimization Data Skew Spark Adaptive Execution

Written by

JD Retail Technology

Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.