Cost‑Effective Real‑Time Data Warehouse 2.0: Migrating from Kafka to Iceberg
iQIYI transformed its real‑time data warehouse by replacing a costly Kafka‑based Lambda stack with a unified stream‑batch Iceberg lake, cutting storage expenses by 90%, halving compute costs, extending data retention, and delivering minute‑level freshness for 90% of use cases while preserving second‑level processing where needed.
In iQIYI's pan‑entertainment ecosystem, data is the core engine driving business growth. Real‑time data requirements have permeated the entire chain—from video playback and membership operations to ad recommendation—requiring sub‑minute feedback (e.g., user click events must be reflected in recommendation models within one minute). As the business scale grew, the second‑level real‑time warehouse built on Kafka faced significant challenges, especially high storage and compute costs.
Existing Architecture Cost Issues
The early real‑time warehouse used a Lambda architecture with Kafka + Flink for second‑level processing, but its cost was more than ten times that of an Iceberg‑based storage solution. Specific problems included:
Storage limitation: Kafka retains data only for a few hours with multiple replicas, making it unsuitable for long‑term storage and historical analysis.
Resource waste: Parallel real‑time and offline pipelines duplicate computation, especially during data cleaning and processing.
Over‑designed latency: Most scenarios only need minute‑level delay, making second‑level processing excessive.
Technical Transformation for Cost Reduction and Efficiency
To address these issues, a unified stream‑batch architecture was introduced, centering on an Iceberg data lake to achieve minute‑level real‑time warehousing. The transformation brought:
Storage cost reduced by 90%: Iceberg’s columnar storage on HDFS with compression costs only 1/10 of Kafka.
Tiered latency: Seconds‑level scenarios still use Kafka; minute‑level scenarios migrate to Iceberg, allocating resources on demand.
Enhanced data governance: Large, monolithic streams are split into thematic micro‑streams, which can be recombined to satisfy diverse business needs.
This shift cut daily PB‑scale processing costs by 60% and upgraded data freshness from hour‑ or day‑level to minute‑level near‑real‑time, dramatically improving business response speed and decision efficiency.
Kafka vs Iceberg Comparison
Feature
Kafka
Iceberg
Latency
Second‑level
Minute‑level (configurable to 1 min)
Storage Cost
High (multiple replicas)
Low (HDFS + compression)
Data Governance
Weak (message queue only)
Strong (schema management, version control)
Compute Engine Support
Flink/Spark Streaming
Flink/Spark/Trino
Applicable Scenarios
Second‑level real‑time (e.g., fraud detection)
Minute‑level analytics, historical back‑track
Real‑Time Warehouse 2.0 Architecture
The DWD layer tables were migrated to Iceberg, and data processing was refactored into Flink jobs. Migration steps included:
Switch core data (playback business) to the new pipeline.
Abstract offline parsing logic into a unified Pingback SDK, enabling consistent deployment for both real‑time and offline paths.
Run dual pipelines for two months to compare and monitor data consistency.
After validation, perform a seamless cut‑over to the new architecture.
Post‑migration benefits:
95% storage resource savings and 50% compute resource savings while maintaining data safety.
Iceberg output simplifies SQL development, aligning with Hive usage.
Data retention extended from a few hours to a month or longer.
Cost Optimization Results
Storage cost reduced by 90% (Iceberg costs 1/10 of Kafka).
Compute resources reused: the same data serves both Flink streaming and Spark batch, boosting cluster utilization by ~40%.
Long‑term data retention eliminates repeated ETL caused by Kafka data expiration.
Overall effect: millions of yuan saved annually.
Timeliness Improvements
Previously, high‑real‑time requirements (e.g., heartbeat timing) were limited by cost, causing minute‑level delays. The upgraded architecture introduces a tiered timeliness design:
Second‑level: Kafka continues to support recommendation, user‑generated content, and ad scenarios.
Minute‑level: Iceberg supports 1‑5 minute latency, covering 90% of real‑time metrics.
Heartbeat timing: Flink + Iceberg incremental processing enables per‑second writes, achieving a ten‑fold speedup in playback duration feedback.
Business Value Release
Switching playback duration reporting from batch to heartbeat flow enables faster user behavior feedback, supporting real‑time updates of user tags, feature vectors, and recommendation models, and broadening operational flexibility.
Future Plans
Real‑time Warehouse 2.0 will continue to promote minute‑level data sources downstream, further optimize architecture, expand data sources, and enrich analysis scenarios to provide stronger data support for business growth.
iQIYI Technical Product Team
The technical product team of iQIYI
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.