Practical Experience of In‑Lake Warehouse Implementation Based on Lakehouse Architecture
This article presents a comprehensive overview of Lakehouse‑based in‑lake warehousing, covering common data‑lake misconceptions, the evolution from databases to data warehouses and lakes, the advantages of Lakehouse over traditional architectures, a reference multi‑layer architecture, typical use cases, challenges, future plans, and a brief Q&A.
Introduction
The presentation titled “Practical Experience of In‑Lake Warehouse Implementation Based on Lakehouse Architecture” outlines the background, current industry status, reference architecture, typical scenarios, future plans, and a Q&A session.
1. Background and Industry Status
Common misconceptions about data lakes include believing they are only for massive storage, only for unstructured data, or only for a raw‑data layer. In reality, data lakes also provide batch, real‑time, interactive analytics, and machine‑learning capabilities, and limiting their use raises costs.
Typical data‑lake usage patterns—raw storage only, raw storage plus batch processing, and multi‑cluster construction—often fail to fully exploit lake capabilities and lead to resource waste, data redundancy, and high operational costs.
2. Evolution of Data Platforms
The evolution is described in four stages: (1) Databases, (2) Data warehouses (OLAP‑oriented), (3) Data lakes (large‑scale storage and compute, adding real‑time and ML), and (4) Lakehouse, which adds transactional, update, and real‑time capabilities to the lake, enabling a unified platform for diverse workloads.
3. Real‑Time Computing Architectures
Three real‑time architectures are compared: Lambda architecture (separate batch and stream layers), OLAP‑based real‑time architecture (enhanced Lambda), and Lakehouse‑based stream‑batch unified architecture, which leverages components like Hudi for incremental processing and allows the same code to run in both batch and streaming modes.
4. Lakehouse Reference Architecture
The reference architecture consists of six layers:
Unified compute cluster layer – supports batch, streaming, interactive queries, and ML in a single cluster, avoiding multi‑cluster overhead.
Unified metadata and permission management layer – provides consistent metadata and access control across the platform.
Data integration layer – handles data ingestion and export with optimized commercial or open‑source tools.
Lakehouse layer – builds warehouse layers (raw, detail, summary) on the lake, offering ACID transactions, upserts, and performance features such as indexing and materialized views.
Unified storage layer – uses object or distributed block storage for massive data.
Marketplace layer – offers diverse query engines (e.g., Doris, ClickHouse, HBase, Redis, IoTDB) to meet specific business scenarios.
This architecture reduces resource redundancy, lowers development and operational costs, and improves data processing efficiency.
5. Typical In‑Lake Warehouse Scenarios
Examples include real‑time lake scenarios (stream‑batch unified processing), data‑layer sharing to improve reuse and reduce cost, gradual migration of batch workloads to real‑time, and building mirror tables or slowly changing dimension (SCD) tables with upsert capabilities. Both batch and real‑time pipelines share the same data, simplifying development.
6. Future Planning and Challenges
Challenges identified are high concurrency writes, complex cross‑table transactions, and steep technical thresholds for Lakehouse features. Future plans focus on “out‑of‑the‑box” usability—allowing business SQL to run directly after deployment—and enhancing scenario‑specific performance, especially for interactive queries.
7. Q&A
Q1: Open‑source Lakehouse projects include Hudi, Iceberg, and DeltaLake. Q2: Unified storage for structured and unstructured data relies on object storage and metadata layers; metadata for unstructured files is limited to file‑system information. Q3: Real‑time lake‑layer processing differs from offline only in latency requirements, allowing flexible layer skipping for faster end‑to‑end results.
Thank you for attending the session.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.