Why Sqoop Sync from RDS to Hive Stalls Over 8 Hours and How to Fix It
A Sqoop job that normally finishes within 2.5 hours occasionally takes more than 8 hours due to data skew caused by an unsuitable split column, and the article details the investigation, root‑cause analysis, and a practical solution using a better split column and adjusted parallelism.
1. Phenomenon
Using Sqoop to sync a table from RDS to Hive normally takes less than 2.5 hours, but occasionally the sync exceeds 8 hours, sometimes over 12 hours, affecting downstream batch tasks.
2. Investigation Process
2.1 Check Sqoop sync task logs
MapReduce task 0%‑88% took 1 hour 9 minutes, 88%‑100% took 9 hours 20 minutes; the last map stage was abnormal, prompting a check of the RDS server.
2.2 Check RDS server
Slow query analysis:
Eight slow queries were generated; seven lasted about 40 minutes, one exceeded 10 hours.
Data skew: among eight parallel queries, only three returned rows (30 k, 38 rows, 9.9 billion rows); one query returned 99 % of the data, indicating severe skew.
Root cause analysis:
The split column
incr_idis neither a primary key nor indexed.
Sqoop’s splitter converts Varchar columns to
BigDecimalbefore splitting, which can produce garbled values and cause skew.
Sqoop documentation shows that for String/Varchar columns the splitter first converts the value to
BigDecimal, splits, then converts back to String, which can introduce garbled data.
Source code inspection confirms
IntegerSplitterfor integer columns and
TextSplitter(via
BigDecimal) for Varchar columns, reproducing the garbled issue.
Splitting by
incr_idleads to data skew.
3. Solution
3.1 Add a BigInt auto‑increment column in the RDS table
Adding an auto‑increment column as primary key or index and using it as the split column would solve the problem, but it requires significant changes, so it was not adopted.
3.2 Find a more evenly distributed column in the existing table
Using the
valuecolumn (prefix 00‑FF) as the split column, setting parallelism to 6 or 7 yields more balanced splits. Tests show that setting parallelism to 7 produces evenly distributed ranges (0‑3, 3‑6, 6‑9, 9‑&, …).
Thus the Sqoop command was adjusted to
--split-by value -m 7.
4. Result
After the change, the sync time from August 8‑11 dropped to 1‑2 hours, and the data split was uniform.
5. Optimization Recommendations
For large tables, add a BigInt auto‑increment column as primary key or index to facilitate future data synchronization.
Data Thinking Notes
Sharing insights on data architecture, governance, and middle platforms, exploring AI in data, and linking data with business scenarios.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.