Big Data 25 min read

Data Lake Concepts, Benefits, and Iceberg‑Based Implementations at iQIYI

iQIYI’s data lake combines public‑cloud and private storage with Apache Iceberg’s snapshot‑based table format to enable near‑real‑time, unified batch‑and‑stream analytics, reducing costs, simplifying architecture, and improving data freshness across use cases such as log collection, audit, pingback, and member order processing.

iQIYI Technical Product Team

Feb 3, 2023

Data Lake Concepts, Benefits, and Iceberg‑Based Implementations at iQIYI

The concept of a data lake was first introduced in 2010. Over the years it has evolved into two main definitions: public‑cloud data lakes and non‑public‑cloud data lakes.

Public‑Cloud Data Lake

Public‑cloud providers such as AWS, Google Cloud, Alibaba Cloud, and Tencent Cloud describe a data lake as a centralized, virtually unlimited storage area that can hold structured, semi‑structured, and unstructured data. In practice, a public‑cloud data lake is simply the provider’s object‑storage service (e.g., AWS S3, Google Cloud Storage, OSS).

Before cloud storage, enterprise data lived in isolated business databases and only processed structured data due to limited storage capacity. The emergence of Hadoop and cloud object storage solved this problem by allowing all data types to be ingested into a single repository for downstream processing, giving rise to the modern data lake.

Non‑Public‑Cloud Data Lake

Hadoop‑based storage and public‑cloud object storage support only file‑level operations (upload, delete) and cannot modify individual rows. Consequently, data warehouses built on these layers cannot provide real‑time incremental updates or streaming ingestion; latency is typically measured in hours or T+1.

To address this limitation, companies such as Uber, Netflix, and Databricks released Hudi, Iceberg, and Delta Lake (2017‑2019). These projects add a generic table‑format layer on top of Hadoop or object storage, and in the industry they are often collectively referred to as “data lakes” in non‑public‑cloud scenarios.

iQIYI Data Lake

Based on the two definitions above, iQIYI’s data lake is expected to have the following characteristics:

Unified storage: supports flexible storage back‑ends (public‑cloud, private‑cloud, HDD/SSD/cache) with sufficient capacity.

Common data abstraction/organization layer: supports structured, semi‑structured, and unstructured data (current table formats only unify structured data).

Support for batch processing, stream processing, and machine‑learning workloads.

Unified data management (metadata, lifecycle, governance) to avoid data silos.

We call this the “broad” data lake, while the “narrow” data lake specifically refers to Hudi, Iceberg, or Delta Lake tables.

Why a Data Lake Is Needed

Three typical scenarios illustrate the business need for a data lake:

Scenario 1 – Real‑time Event‑Stream Analysis

Good timeliness: near‑real‑time visibility (1‑5 minutes) compared with Hive’s offline latency.

Large scale: HDFS provides virtually unlimited write throughput.

Low cost: no dedicated servers, low hardware and O&M cost.

Query support: both detail‑level queries and complex joins.

Data sharing: multiple engines (Spark, Trino, Flink) can read the same lake.

Scenario 2 – Change‑Data Analysis

Row‑level changes (e.g., MySQL order updates) are hard to handle with traditional warehouses. A data lake enables near‑real‑time ingestion of change events and incremental updates without full partition rewrites.

Scenario 3 – Stream‑Batch Integration

Traditional Lambda architectures require separate batch and real‑time pipelines, leading to high development and maintenance cost and data inconsistency. A data lake can provide a unified stream‑batch solution.

Technical Summary – Apache Iceberg

Iceberg is a modern open‑source table format for large‑scale analytics. It is not a storage engine (it works on HDFS, S3, etc.) and not a file format (data files are Parquet). Queries can be executed via Spark, Flink, Trino, Hive, etc.

Key design points:

Snapshots: a table points to a snapshot (e.g., S1). Writes create a new snapshot (S2) that is invisible until committed.

Metadata stored at file level, enabling fast planning and file‑level filtering.

Snapshot isolation: reads and writes operate on different snapshots.

Optimistic concurrency for parallel writes.

Efficient small modifications: new files can be added without full partition replacement.

Row‑level updates via DeleteFile and Merge‑On‑Read (Iceberg V2).

Case Studies at iQIYI

Venus Log Collection Platform

Original architecture stored logs in Elasticsearch, which suffered high cost, write failures, and low scalability. After migrating to Iceberg on HDFS, Venus achieved:

Cost reduction: shared HDFS storage and a single Trino cluster.

Write stability: HDFS replication eliminates single‑node failures.

Operational overhead cut by ~80%.

Audit Data

Previously a mix of MongoDB, Elasticsearch, Hive, and MySQL caused high development cost and poor query performance. Replacing the stack with Iceberg tables enabled row‑level updates, fast multi‑column filtering, PB‑scale storage, and near‑real‑time (≈5 minutes) data freshness.

Pingback Stream‑Batch Integration

Pingback previously used a Lambda architecture (Kafka + Flink for real‑time, HDFS + Hive/Spark for batch). By building a near‑real‑time Iceberg pipeline (Kafka → Flink → Iceberg ODS → DWD), iQIYI achieved:

Latency reduced to <5 minutes while supporting both incremental and full scans.

Elimination of duplicate code bases and data inconsistency.

Cost savings of ~90% for the real‑time path.

Member Order

Order data originally lived in MySQL and was exported to Hive or written to Kudu via CDC. After adopting Iceberg V2 with Flink CDC ingestion:

Latency as low as 1 minute.

Query performance comparable to Kudu/Impala.

Significant cost reduction (no dedicated OLAP cluster).

Reduced operational pressure on MySQL.

Summary and Roadmap

Data lake technology, especially Apache Iceberg, is rapidly maturing. iQIYI’s deployments have demonstrated substantial business value: lower cost, higher data freshness, simplified architecture, and improved data quality. Future work includes extending real‑time attribution for advertising, accelerating feature production for machine‑learning, and further promoting stream‑batch integration across more services.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data lake Apache Iceberg Data Architecture Streaming‑Batch Integration

Written by

iQIYI Technical Product Team

The technical product team of iQIYI

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Public‑Cloud Data Lake

Non‑Public‑Cloud Data Lake

iQIYI Data Lake

Why a Data Lake Is Needed

Scenario 1 – Real‑time Event‑Stream Analysis

Scenario 2 – Change‑Data Analysis

Scenario 3 – Stream‑Batch Integration

Technical Summary – Apache Iceberg

Case Studies at iQIYI

Venus Log Collection Platform

Audit Data

Pingback Stream‑Batch Integration

Member Order

Summary and Roadmap

iQIYI Technical Product Team

How this landed with the community

Was this worth your time?

0 Comments

Scenario 1 – Real‑time Event‑Stream Analysis

Scenario 2 – Change‑Data Analysis

Scenario 3 – Stream‑Batch Integration