Lakehouse Analysis Service (LAS): Architecture, Challenges, and Service Design
The article introduces the Lakehouse Analysis Service (LAS), explains its layered architecture that unifies data lake and warehouse capabilities, discusses challenges with Apache Hudi metadata and consistency, and details the design of the unified MetaServer, Table Management Service, concurrency control, async compaction, event bus, and future roadmap.
LAS Introduction – LAS (Lakehouse Analysis Service) combines the advantages of data lakes and warehouses, storing all data in low‑cost storage for ML and analytics while providing a warehouse layer for BI reporting.
Overall Architecture – The stack consists of a lake‑warehouse development tool, a streaming‑batch unified analysis engine (routing SQL to Spark, Presto, or Flink), a unified metadata layer, and a storage layer that separates compute and storage for elastic scaling and cost reduction.
Problems and Challenges – Built on Apache Hudi, LAS faced metadata isolation (no global view), snapshot consistency, and compaction latency issues. To address these, LAS introduced a unified MetaServer for a reliable global metadata view and an async compaction mechanism that decouples compaction from commit.
Metadata Service (MetaServer) Design – MetaServer comprises three modules: Hudi Catalog (client‑side table abstraction), core MetaServer (stateless service handling metadata CRUD), and an Event Bus for propagating metadata changes. It stores schema, partition info, timeline commits, and snapshot data, supporting schema evolution and versioned concurrency control.
Service Layer – Divided into Table, Partition, Timeline, and Snapshot services, each handling specific metadata requests. The Table Management Service (TMS) fully manages asynchronous tasks such as Compaction, Clean, and Clustering, listening to MetaServer events and generating action plans.
Concurrency Control – Uses optimistic locking, CAS‑enabled storage, versioned timelines, and configurable conflict‑check granularity (table, partition, file‑group, file) to maximize concurrent writes while ensuring consistency.
Event Bus – Encapsulates metadata DDL changes as events, allowing downstream components (e.g., Hive Catalog Listener) to synchronize schema updates.
Table Management Service Details – Consists of a Plan Generator (interacts with MetaServer to create action plans) and a Job Manager (schedules Spark/Flink jobs via Yarn/K8s). It operates in a master‑worker architecture to avoid race conditions.
Future Plans – Focus on accelerating metadata access, data storage, and index queries, including caching layers (e.g., Alluxio) and tighter integration with MetaServer.
Q&A Highlights – Routing decisions use a unified optimizer with engine‑specific runtimes; consistency across engines is achieved via ANSI/Hive semantics; LAS currently supports per‑table ingestion; conflict‑check strategies are configurable; async compaction runs on shared resources without impacting queries.
Overall, LAS provides a cloud‑native, multi‑tenant, high‑availability lakehouse platform that abstracts storage complexities while delivering unified streaming‑batch analytics.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.