Big Data 5 min read

Why Sqoop Sync from RDS to Hive Stalls Over 8 Hours and How to Fix It

A Sqoop job that normally finishes within 2.5 hours occasionally takes more than 8 hours due to data skew caused by an unsuitable split column, and the article details the investigation, root‑cause analysis, and a practical solution using a better split column and adjusted parallelism.

Data Thinking Notes
Data Thinking Notes
Data Thinking Notes
Why Sqoop Sync from RDS to Hive Stalls Over 8 Hours and How to Fix It

1. Phenomenon

Using Sqoop to sync a table from RDS to Hive normally takes less than 2.5 hours, but occasionally the sync exceeds 8 hours, sometimes over 12 hours, affecting downstream batch tasks.

2. Investigation Process

2.1 Check Sqoop sync task logs

MapReduce task 0%‑88% took 1 hour 9 minutes, 88%‑100% took 9 hours 20 minutes; the last map stage was abnormal, prompting a check of the RDS server.

2.2 Check RDS server

Slow query analysis:

Eight slow queries were generated; seven lasted about 40 minutes, one exceeded 10 hours.

Data skew: among eight parallel queries, only three returned rows (30 k, 38 rows, 9.9 billion rows); one query returned 99 % of the data, indicating severe skew.

Root cause analysis:

The split column

incr_id

is neither a primary key nor indexed.

Sqoop’s splitter converts Varchar columns to

BigDecimal

before splitting, which can produce garbled values and cause skew.

Sqoop documentation shows that for String/Varchar columns the splitter first converts the value to

BigDecimal

, splits, then converts back to String, which can introduce garbled data.

Source code inspection confirms

IntegerSplitter

for integer columns and

TextSplitter

(via

BigDecimal

) for Varchar columns, reproducing the garbled issue.

Splitting by

incr_id

leads to data skew.

3. Solution

3.1 Add a BigInt auto‑increment column in the RDS table

Adding an auto‑increment column as primary key or index and using it as the split column would solve the problem, but it requires significant changes, so it was not adopted.

3.2 Find a more evenly distributed column in the existing table

Using the

value

column (prefix 00‑FF) as the split column, setting parallelism to 6 or 7 yields more balanced splits. Tests show that setting parallelism to 7 produces evenly distributed ranges (0‑3, 3‑6, 6‑9, 9‑&, …).

Thus the Sqoop command was adjusted to

--split-by value -m 7

.

4. Result

After the change, the sync time from August 8‑11 dropped to 1‑2 hours, and the data split was uniform.

5. Optimization Recommendations

For large tables, add a BigInt auto‑increment column as primary key or index to facilitate future data synchronization.

big dataPerformance TuningHiveData SkewRDSSqoop
Data Thinking Notes
Written by

Data Thinking Notes

Sharing insights on data architecture, governance, and middle platforms, exploring AI in data, and linking data with business scenarios.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.