Data Lake: Concepts, Architecture, and Application in iQIYI's Data Platform
iQIYI’s data‑middle‑platform team built a four‑zone data lake—raw, product, work, and sensitive—integrated with unified ODS/DWD/MID layers, a metadata catalog, and self‑service tools, leveraging HDFS, Hive/Iceberg, Spark/Trino, and Flink, migrated to Apache Iceberg for real‑time freshness, and now aims to further streamline modules and adopt new technologies.
As the data middle‑platform team of iQIYI, we are responsible for managing massive data assets and continuously adopting new concepts and tools to refine our data governance. The "data lake" has become a widely discussed concept in recent years, offering a new perspective for governing, integrating, and processing data.
Data lakes aim to provide an efficient storage and management solution that enhances data usability and availability. Their value lies in two main aspects: (1) the ability to store all data comprehensively, regardless of current usage, ensuring easy retrieval when needed; (2) organized, scientifically managed data that enables self‑service access, reducing the reliance on data engineers.
To manage different data types, we divide the lake into four core zones:
Raw Zone : stores raw, unprocessed data for data engineers and scientists, with optional limited access.
Product Zone : contains data processed and standardized by engineers, scientists, and analysts, used for reporting, analysis, and machine learning.
Work Zone : holds intermediate data generated by data workers, managed by the users themselves for flexible exploration.
Sensitive Zone : stores highly sensitive data (PII, financial, compliance) with strict access controls.
Applying the data‑lake philosophy to our data middle platform, we unified the data warehouse layers (ODS, DWD, MID) and built a metadata center, data asset center, and data catalog to support self‑service queries and permission requests. We also provided a self‑service analysis platform and upgraded the overall architecture.
Technical stack:
Underlying data layer: various sources such as Pingback logs, relational databases, and NoSQL stores.
Storage layer: primarily HDFS for raw files; Hive, Iceberg, or HBase for structured/unstructured data.
Compute layer: offline engines (Pilot driving Spark or Trino), scheduling engine Gear, real‑time engine RCP (now Flink).
Development layer: toolkits for building offline/real‑time workflows, data integration, and machine‑learning pipelines; data‑lake platform manages file/table metadata, while the warehouse platform manages data models, dimensions, and metrics.
Data‑lake technologies include Delta Lake, Hudi, and Iceberg. After evaluation, we selected Apache Iceberg as the table format because it supports efficient row‑level updates and minute‑level data freshness, outperforming traditional Hive tables.
We migrated ODS and DWD tables to Iceberg and refactored processing into Flink jobs. The migration was performed in phases, starting with non‑core data (e.g., QOS and custom deliveries), abstracting parsing logic into a unified SDK, running parallel dual‑pipeline tests for two months, and finally switching without user impact. The results include near‑real‑time data (5‑minute latency) for QOS and custom deliveries, elimination of duplicate pipelines, and resource savings.
Future plans focus on further refining modules to make the data platform more comprehensive and user‑friendly, continuing stream‑batch integration, and adopting new technologies to improve data production efficiency and reduce costs.
iQIYI Technical Product Team
The technical product team of iQIYI
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.