Big Data 20 min read

Smooth Migration from Hive to Iceberg Data Lake at iQIYI: Architecture, Techniques, and Performance Evaluation

iQIYI migrated hundreds of petabytes of Hive tables to Apache Iceberg using dual‑write, in‑place, and CTAS strategies, combined with partition pruning, Bloom filters, and Trino/Alluxio optimizations, achieving up to 40% lower query latency, simplified pipelines, and faster, cost‑effective data lake operations.

iQIYI Technical Product Team

Mar 8, 2024

Smooth Migration from Hive to Iceberg Data Lake at iQIYI: Architecture, Techniques, and Performance Evaluation

iQIYI originally built a traditional offline data warehouse on Hive to support operations, user growth, video recommendation, membership, advertising and other business needs. With increasing real‑time requirements, the company introduced Apache Iceberg as a data‑lake solution to dramatically improve query performance and overall data circulation efficiency. Migrating hundreds of petabytes of existing Hive tables to Iceberg posed a major challenge.

This article describes iQIYI’s end‑to‑end technical solution for a smooth migration from Hive to Iceberg, covering architecture, migration methods, performance optimizations, and business impact.

Hive vs. Iceberg

Hive is a Hadoop‑based data warehouse that provides a SQL‑like language for complex data processing. Iceberg is an open‑source table format that offers scalable, stable, and efficient storage with transactional guarantees, supporting updates and deletes.

Key differences (see Table 1‑1 in the original article) include timeliness, query performance, and support for DDL operations.

Migration Strategies

Three main approaches are discussed:

Dual‑write (business double‑write) : Run pipelines that write simultaneously to Hive and Iceberg, then switch once counts match.

In‑place migration with client pause : Stop writes temporarily and use built‑in procedures.

Self‑service migration tools provided on the data‑lake platform.

Two concrete code‑based methods are highlighted: CALL catalog_name.system.migrate('db.sample'); This Iceberg‑provided Spark procedure converts a Hive table to Iceberg in‑place, scanning the original files and building Iceberg metadata without modifying the data.

Alternatively, a CTAS (Create‑Table‑As‑Select) approach can be used:

CREATE TABLE db.sample_iceberg (
  id BIGINT, ..., dt STRING
) USING Iceberg
PARTITIONED BY dt
LOCATION 'qbfs://....'
TBLPROPERTIES('write.target-file-size-bytes' = '512m', ...)
AS SELECT * FROM db.sample;

After the CTAS finishes, the tables are swapped:

ALTER TABLE db.sample RENAME TO db.sample_backup;

ALTER TABLE db.sample_iceberg RENAME TO db.sample;

The CTAS method rewrites historical data, allowing partition pruning, column ordering, and file‑size tuning, at the cost of higher I/O for large datasets.

Performance Acceleration Techniques

Iceberg provides three layers of data filtering:

Partition pruning : Similar to Hive, the engine can prune partitions based on WHERE predicates.

File‑level filtering : Min/Max statistics and optional Bloom filters let the engine skip irrelevant files.

Row‑group filtering : Columnar formats (Parquet, ORC) expose statistics per row group, enabling further pruning.

Additional tricks include:

Ordering data on high‑cardinality columns to improve Min/Max selectivity.

Enabling Bloom filters on frequently queried columns (e.g., write.parquet.bloom-filter-enabled.column.test = true).

Using Trino instead of Spark for faster MPP queries.

Caching with Alluxio to serve hot data from memory.

Choosing ORC over Parquet when Trino is the query engine.

Merging small files into larger ones to reduce I/O.

Change Management in Iceberg

Iceberg supports row‑level DML:

DELETE FROM table_name WHERE channel_id = '2';

UPDATE table_name SET category = 'c2' WHERE id = '2';

MERGE INTO db.target t USING (SELECT …) s ON t.id = s.id WHEN MATCHED THEN UPDATE SET … WHEN NOT MATCHED THEN INSERT …;

Three change strategies are available:

Copy‑on‑Write (CoW) : Updated rows are rewritten; good for read‑heavy workloads.

Merge‑on‑Read (MoR) : New files record changes; reads merge deltas at query time. MoR can use position deletes (file‑offset based) or equality deletes (primary‑key based).

Configuration properties such as write.delete.mode, write.merge.mode, and write.update.mode control the chosen strategy.

Business Impact

By adopting Iceberg, iQIYI achieved:

Faster query latency (up to 40 % improvement for sorted data, 2‑3× speedup of Trino vs. Spark).

Simplified pipelines: MERGE statements replace multi‑table joins for advertising billing, enabling hour‑level reporting.

Efficient data correction without full re‑runs (single‑statement UPDATE/DELETE).

Support for data‑change requirements such as GDPR deletions, incremental ETL, and CDC sync.

The migration framework allows a seamless switch from Hive to Iceberg with minimal downtime, preserving semantic compatibility while delivering higher performance and lower operational cost.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Migration Performance Optimization Hive data lake Iceberg

Written by

iQIYI Technical Product Team

The technical product team of iQIYI

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.