Smooth Migration from Hive to Iceberg Data Lake at iQIYI: Architecture, Techniques, and Performance Evaluation
iQIYI migrated hundreds of petabytes of Hive tables to Apache Iceberg using dual‑write, in‑place, and CTAS strategies, combined with partition pruning, Bloom filters, and Trino/Alluxio optimizations, achieving up to 40% lower query latency, simplified pipelines, and faster, cost‑effective data lake operations.
iQIYI originally built a traditional offline data warehouse on Hive to support operations, user growth, video recommendation, membership, advertising and other business needs. With increasing real‑time requirements, the company introduced Apache Iceberg as a data‑lake solution to dramatically improve query performance and overall data circulation efficiency. Migrating hundreds of petabytes of existing Hive tables to Iceberg posed a major challenge.
This article describes iQIYI’s end‑to‑end technical solution for a smooth migration from Hive to Iceberg, covering architecture, migration methods, performance optimizations, and business impact.
Hive vs. Iceberg
Hive is a Hadoop‑based data warehouse that provides a SQL‑like language for complex data processing. Iceberg is an open‑source table format that offers scalable, stable, and efficient storage with transactional guarantees, supporting updates and deletes.
Key differences (see Table 1‑1 in the original article) include timeliness, query performance, and support for DDL operations.
Migration Strategies
Three main approaches are discussed:
Dual‑write (business double‑write) : Run pipelines that write simultaneously to Hive and Iceberg, then switch once counts match.
In‑place migration with client pause : Stop writes temporarily and use built‑in procedures.
Self‑service migration tools provided on the data‑lake platform.
Two concrete code‑based methods are highlighted:
CALL catalog_name.system.migrate('db.sample');
This Iceberg‑provided Spark procedure converts a Hive table to Iceberg in‑place, scanning the original files and building Iceberg metadata without modifying the data.
Alternatively, a CTAS (Create‑Table‑As‑Select) approach can be used:
CREATE TABLE db.sample_iceberg ( id BIGINT, ..., dt STRING ) USING Iceberg PARTITIONED BY dt LOCATION 'qbfs://....' TBLPROPERTIES('write.target-file-size-bytes' = '512m', ...) AS SELECT * FROM db.sample;
After the CTAS finishes, the tables are swapped:
ALTER TABLE db.sample RENAME TO db.sample_backup;
ALTER TABLE db.sample_iceberg RENAME TO db.sample;
The CTAS method rewrites historical data, allowing partition pruning, column ordering, and file‑size tuning, at the cost of higher I/O for large datasets.
Performance Acceleration Techniques
Iceberg provides three layers of data filtering:
Partition pruning : Similar to Hive, the engine can prune partitions based on WHERE predicates.
File‑level filtering : Min/Max statistics and optional Bloom filters let the engine skip irrelevant files.
Row‑group filtering : Columnar formats (Parquet, ORC) expose statistics per row group, enabling further pruning.
Additional tricks include:
Ordering data on high‑cardinality columns to improve Min/Max selectivity.
Enabling Bloom filters on frequently queried columns (e.g., write.parquet.bloom-filter-enabled.column.test = true ).
Using Trino instead of Spark for faster MPP queries.
Caching with Alluxio to serve hot data from memory.
Choosing ORC over Parquet when Trino is the query engine.
Merging small files into larger ones to reduce I/O.
Change Management in Iceberg
Iceberg supports row‑level DML:
DELETE FROM table_name WHERE channel_id = '2';
UPDATE table_name SET category = 'c2' WHERE id = '2';
MERGE INTO db.target t USING (SELECT …) s ON t.id = s.id WHEN MATCHED THEN UPDATE SET … WHEN NOT MATCHED THEN INSERT …;
Three change strategies are available:
Copy‑on‑Write (CoW) : Updated rows are rewritten; good for read‑heavy workloads.
Merge‑on‑Read (MoR) : New files record changes; reads merge deltas at query time. MoR can use position deletes (file‑offset based) or equality deletes (primary‑key based).
Configuration properties such as write.delete.mode , write.merge.mode , and write.update.mode control the chosen strategy.
Business Impact
By adopting Iceberg, iQIYI achieved:
Faster query latency (up to 40 % improvement for sorted data, 2‑3× speedup of Trino vs. Spark).
Simplified pipelines: MERGE statements replace multi‑table joins for advertising billing, enabling hour‑level reporting.
Efficient data correction without full re‑runs (single‑statement UPDATE/DELETE).
Support for data‑change requirements such as GDPR deletions, incremental ETL, and CDC sync.
The migration framework allows a seamless switch from Hive to Iceberg with minimal downtime, preserving semantic compatibility while delivering higher performance and lower operational cost.
iQIYI Technical Product Team
The technical product team of iQIYI
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.