Big Data 18 min read

LakeSoul: An Open‑Source Real‑Time Data Lakehouse Framework – Design, Architecture, Benchmarks and Future Roadmap

This article introduces LakeSoul, an open‑source end‑to‑end real‑time lakehouse framework, detailing its design philosophy, key technologies such as ELT, metadata management, upsert and merge‑on‑read capabilities, performance benchmarks, real‑world use cases, and the roadmap for future enhancements.

DataFunSummit

Aug 4, 2023

LakeSoul: An Open‑Source Real‑Time Data Lakehouse Framework – Design, Architecture, Benchmarks and Future Roadmap

LakeSoul Design Philosophy

LakeSoul is positioned as an end‑to‑end open‑source real‑time lakehouse framework that adopts an ELT model, allowing data to be ingested into the lake first and then processed in layered models, unifying storage, compute, and AI/BI capabilities.

Background

In cloud‑native environments, object storage provides cheap, scalable storage for massive structured, semi‑structured, and unstructured data. Traditional ETL pipelines suffer from multiple processing chains, inconsistent storage, high maintenance costs, and lack of ACID guarantees. LakeSoul addresses these issues with a unified ELT approach.

LakeSoul Positioning

LakeSoul offers cloud‑native lake‑warehouse construction, low‑code data ingestion supporting both real‑time and batch, high‑throughput upsert, ACID and time‑travel capabilities, and integrated AI/BI support (SQL, Pandas, PyTorch).

Overall Architecture

The top layer is a distributed metadata service managing schemas and providing ACID‑based concurrency control, supporting millions of partitions and billions of files. The compute layer integrates engines such as Flink, Spark, Hive, and future Presto support. The storage layer connects to HDFS, S3, MinIO, OSS, using open formats like Parquet and Avro.

Technical Highlights

Metadata Layer

LakeSoul uses PostgreSQL for metadata management, providing primary‑key based tables, transactional concurrency control, snapshot reads, and two‑phase commit for exactly‑once semantics, scaling to billions of metadata entries.

Upsert and Merge‑On‑Read (MOR)

Upsert generates new versions with primary‑key‑based hash partitioning and sorting, enabling high‑throughput writes (10⁵+ rows/sec per core) and efficient MOR that merges sorted files at read time, with customizable operators for aggregation or null handling.

IO Layer

The IO layer, implemented in Rust, provides language‑agnostic read/write APIs (C, Java, Python) and asynchronous acceleration, delivering 3‑4× read and 1.4× write performance improvements over native Spark‑Parquet.

LakeHouse Ecosystem

LakeSoul supports automatic real‑time ingestion from heterogeneous sources (CDC, Kafka, databases), incremental ODS/DWD/DWS modeling, snapshot reads, rollbacks, and downstream integration with engines like Flink, Spark, Pandas, and PyTorch.

Benchmarks

Using a CCF data‑lake competition dataset (11 files, 10 incremental versions), LakeSoul outperforms Iceberg and Hudi in both copy‑on‑write and merge‑on‑read modes, achieving several‑fold speedups in read and write due to its Rust‑based IO and efficient metadata handling.

Application Cases

LakeSoul enables real‑time large‑wide tables without costly joins, supports multi‑source data synchronization via Flink CDC, and provides low‑code incremental operators (filter, group‑by, join) defined in YAML, guaranteeing exactly‑once semantics.

Future Roadmap

Further IO performance optimizations and native compaction integration.

Support for non‑primary‑key merge‑into and expanded ecosystem connectors (Presto, Pandas, etc.).

Release incremental read operators, materialized views, and incremental updates.

Donate the project to the Linux Foundation AI & Data open‑source organization to broaden community impact.

Q&A Highlights

LakeSoul’s metadata layer uses PostgreSQL, avoiding small‑file issues of Iceberg/Hudi.

Flink CDC enables one‑click whole‑database synchronization with automatic schema change detection.

Rust‑based IO provides vectorized operations and significant read/write speedups.

Operator‑based MOR simplifies custom merge logic compared to Hudi payloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Flink CDC Data Lakehouse ELT LakeSoul Rust IO

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.