Big Data 16 min read

Data Lake Construction and Practice at NetEase Yanxuan

NetEase Yanxuan replaced its cumbersome data‑warehouse with a flexible Delta‑Lake/Iceberg data lake, creating a unified metadata layer and real‑time ingestion pipelines that cut latency from nightly batches to seconds, slashed compute and storage costs, supported diverse business scenarios and machine‑learning feature engineering, and set the stage for broader future expansion.

NetEase Yanxuan Technology Product Team

Mar 30, 2022

Data Lake Construction and Practice at NetEase Yanxuan

NetEase Yanxuan began building its big data system in mid‑2017, and it now supports nearly all business scenarios including commercial analysis, search, recommendation, advertising, supply chain, risk control, product development, and quality control. As reliance on data grew, the original data‑warehouse approach revealed several problems: low data turnover efficiency, high model development and iteration costs, slow production speed that could not keep up with demand, and frequent schema changes that were costly and disruptive. These issues motivated the need for a more efficient, flexible, and near‑real‑time data capability.

To address these challenges, Yanxuan set goals centered on solving the identified problems, improving system efficiency, reducing storage/compute/usage costs, and ensuring stable, large‑scale rollout without impacting existing business. The evaluation of new technologies focused on whether they delivered breakthrough capabilities, improved operational efficiency, lowered costs, and could be deployed reliably.

The article then explores the concept of a data lake, contrasting it with a traditional data warehouse. A data lake prioritizes flexibility: data can be stored in structured, semi‑structured, or unstructured forms without a predefined schema, and compute engines can read/write the lake according to scenario needs, preserving all original information and enabling more efficient, exploratory analysis. In contrast, a data warehouse emphasizes standardized management with pre‑defined schemas and modeling before data is accessed. The authors note that the two approaches are not mutually exclusive and can be combined based on specific use‑cases.

Key advantages of adopting a data lake for Yanxuan include: significantly improved data development efficiency (avoiding the need to build warehouse models for ad‑hoc exploration), prevention of information loss (retaining raw details that might be discarded in warehouse modeling), and the ability to provide reliable near‑real‑time data access through technologies such as Delta Lake, Iceberg, and Hudi, which supply ACID transactions and better real‑time performance.

In practice, Yanxuan selected Delta Lake as its initial storage format (later adding Iceberg support) because it met the requirement for row‑level deletes needed in their data‑integration scenario. Data is ingested via Flume for logs and Canal for Binlog into Kafka, then processed by a DataHub‑Hound (Kafka2Hive) task to land raw data, followed by merge jobs that produce ODS‑layer snapshots. A unified metadata abstraction layer manages storage format definitions and queries, enabling multiple platforms to interact with the lake. Compute engines (Flink, Spark, Presto) are integrated through this layer, and stability is ensured by metadata, lineage, and monitoring services.

Data integration evolved through three versions: V1 performed full nightly loads; V2 introduced incremental merge, reducing latency to about an hour; V3 leveraged Delta/Iceberg to achieve real‑time ingestion with average latency around 1 second (minute‑level for large tables), saving roughly 70 % of compute and storage resources and eliminating downstream failures thanks to ACID support.

For machine‑learning workloads, the data lake powers feature engineering. Flink jobs compute features, write them to Redis for online inference, and append processed features to an Iceberg‑based feature table, ensuring consistency between offline training and online serving and enabling more real‑time model updates.

Looking forward, Yanxuan plans to extend lake capabilities to more critical services such as search, recommendation, and risk control, and to further optimize the underlying compute and storage engines to support increasingly complex scenarios.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Flink feature engineering data integration data lake Iceberg Delta Lake

Written by

NetEase Yanxuan Technology Product Team

The NetEase Yanxuan Technology Product Team shares practical tech insights for the e‑commerce ecosystem. This official channel periodically publishes technical articles, team events, recruitment information, and more.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.