Big Data 13 min read

Understanding and Solving Data Skew in Hadoop and Spark

This article explains what data skew is, why it occurs in large‑scale Hadoop and Spark jobs, illustrates typical scenarios that cause it, and provides practical strategies and platform‑specific optimizations to detect, mitigate, and prevent skew in big‑data processing pipelines.

Architect
Architect
Architect
Understanding and Solving Data Skew in Hadoop and Spark

Data skew is a pervasive obstacle in the big‑data domain; when processing billions or even trillions of records, uneven data distribution can cause a massive performance bottleneck.

When data distribution is insufficiently dispersed, a large volume of records may be concentrated on one or a few machines, making the computation speed far slower than the average and potentially stalling the entire job for weeks.

Disclaimer: The topic is broad and technically demanding; the author shares personal understanding and invites readers to discuss any inaccuracies or omissions.

Article Structure

Explain what data skew is.

Describe scenarios that lead to data skew.

Analyze the causes of data skew in Hadoop and Spark.

Provide solutions and optimizations for data skew.

0x01 What Is Data Skew

Simply put, data skew occurs when the data distribution is poor, causing a massive amount of data to be processed on a single or a few nodes, which runs far slower than the average and drags the whole job down.

Key Term: Data Skew

Most data engineers encounter data skew in various stages of data development, such as:

Hive reduce phase stuck at 99.99%.

Spark Streaming jobs that repeatedly encounter executor OOM while other executors remain idle.

These issues often make hours‑long jobs fail to finish, leading to frustration.

0x02 What Data Skew Looks Like

The author illustrates several typical symptoms, focusing on Hadoop and Spark as the most common platforms.

1. Data Skew in Hadoop

In Hadoop, data skew usually manifests as the reduce phase stuck at 99.99%.

One or more reducers are blocked.

Containers report OOM.

Read/write volume for the affected reducers is far larger than normal.

Such skew often leads to task kills and other strange behaviors.

2. Data Skew in Spark

In Spark (including Spark Streaming and Spark SQL), common symptoms include:

Executor lost, OOM, shuffle errors.

Driver OOM.

One executor runs for an excessively long time, blocking the whole job.

Sudden task failures.

Streaming jobs are especially vulnerable when joins or group‑by operations are present, because memory is limited and skew can quickly cause OOM.

0x03 The Principle Behind Data Skew

1. Causes of Data Skew

Operations such as count(distinct) , group by , and join trigger a shuffle; if the key distribution is uneven, a large amount of data is sent to a single node, creating a hotspot.

2. The Evil Shuffle

Both Hadoop and Spark rely on shuffle to exchange data between map and reduce stages. When the key distribution is skewed, the shuffle concentrates massive data on one node, leading to the problems described above.

3. Understanding Skew from a Data Perspective

Example tables: user(userid, register_ip) and ip(ip, register_user_cnt) . If missing IPs are defaulted to 0 , a join on the IP field will funnel a huge number of rows to a single reducer, causing a stall.

4. Understanding Skew from a Business Perspective

Business events can create uneven data distribution—for instance, a sudden promotion in Beijing and Shanghai may increase orders by 10,000% in those cities while other regions stay flat, causing a group‑by on city to skew heavily.

0x04 How to Solve Data Skew

Solving data skew involves both business‑level adjustments and technical optimizations.

General Strategies

Adjust business logic, e.g., compute hot keys separately and then merge results.

Programmatic fixes, such as rewriting count(distinct) as a two‑step group by followed by count .

Parameter tuning: leverage built‑in Hadoop and Spark settings that mitigate skew.

Data‑Centric Solutions

Lossy approach: filter out abnormal data (e.g., rows with IP = 0).

Lossless approach: compute skewed keys separately or add a hash layer to spread the data before aggregation.

Pre‑process data to clean or rebalance distributions.

Hadoop‑Specific Optimizations

Use map‑join (broadcast join) when one side is small.

Rewrite count(distinct) as group by then count .

Enable hive.groupby.skewindata=true .

Leverage left‑semi join where appropriate.

Compress map‑output and intermediate results to reduce I/O and network pressure.

Spark‑Specific Optimizations

Use map‑join (broadcast join) for small tables.

Enable RDD compression.

Allocate sufficient driver memory.

Apply Spark SQL optimizations similar to Hive (e.g., skew handling settings).

0xFF Summary

Data skew remains a significant challenge in large‑scale data processing; addressing it requires a combination of data design, business understanding, and platform‑specific tuning. The ideas presented here aim to give readers practical ways to identify and mitigate skew.

Further topics such as detailed Hive SQL tuning, data‑cleansing pitfalls, and challenges beyond the trillion‑record scale will be covered in future articles.

optimizationbig dataData Skewdistributed computingSparkHadoopShuffle
Architect
Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.