Big Data 16 min read

Design Principles, Architecture, and Applications of the Open‑Source LakeSoul Lakehouse Framework

This article provides a comprehensive technical overview of LakeSoul, an open‑source, cloud‑native lakehouse framework, covering its design philosophy, core features, architecture, performance benchmarks, real‑time ingestion, incremental computation, multi‑stream joining, security, community progress, and future roadmap.

DataFunTalk
DataFunTalk
DataFunTalk
Design Principles, Architecture, and Applications of the Open‑Source LakeSoul Lakehouse Framework

Introduction LakeSoul is a domestically developed open‑source lakehouse framework that integrates lake and warehouse capabilities, supporting real‑time data ingestion, streaming‑batch convergence, and BI/AI unified analysis with strong concurrency and I/O performance.

Design Philosophy and Goals The framework follows four main objectives: lake‑warehouse integration, streaming‑batch convergence, end‑to‑end real‑time processing, and BI/AI unification. It emphasizes ELT, multi‑source data, and flexible data formats.

Timeline LakeSoul originated from real‑time recommendation and advertising workloads and evolved through five stages from its initial open‑source release in 2021 to recent performance enhancements and Linux Foundation incubation.

Overall Architecture The system consists of three layers: LakeSoul Storage Layer (supporting S3, HDFS, MinIO, OSS), LakeSoul Query Engine (compatible with Spark, Flink, Hive), and LakeSoul Distributed Meta Service (metadata, ACID, statistics). Data sources and services (BI/AI) connect to these layers.

Data Format Physical data is stored in Parquet with primary‑key hash sharding and range partitioning; metadata is versioned, enabling snapshot reads and rollbacks.

Metadata Management Metadata tables (partition, commit, table info, etc.) are stored in PostgreSQL, leveraging transactions, two‑phase commit, and triggers for automatic compaction and cleanup.

Two‑Phase Commit and Conflict Resolution Writes use a two‑phase commit protocol; compatible conflicts are retried, while incompatible conflicts cause failure.

Snapshot Management Automatic schema evolution and snapshot handling allow seamless reads of evolving data structures.

Native IO Design and Performance Implemented in Rust using Apache Arrow and DataFusion, Native IO provides asynchronous, high‑performance Parquet read/write, multi‑language bindings (C, Java, Python), and outperforms traditional Parquet‑mr in Spark benchmarks.

Core Features

Real‑time data ingestion from databases, Kafka, CDC tools (Flink CDC, Debezium) with exactly‑once guarantees.

Incremental computation supporting Flink changelog streams and full‑batch merges.

Multi‑stream joining that eliminates costly joins by merging heterogeneous streams on common keys.

RBAC permissions using PostgreSQL and Casbin, and data lineage via OpenLineage.

Automatic maintenance including global compaction and expired data cleanup triggered by PostgreSQL events.

Performance evaluation showing 1.7‑2.0× read/write speed improvements over Spark on S3 and significant gains over Iceberg and Hudi in merge‑on‑read scenarios.

Application Scenarios

Building real‑time lakehouse pipelines that combine multi‑source ingestion, unified batch/stream analytics, and BI/AI services.

Real‑time machine‑learning sample generation using multi‑stream concatenation and online model training.

Open‑Source Community and Future Roadmap Since its 2021 release, LakeSoul has been donated to the Linux Foundation and is now a sandbox project. Future plans include built‑in role‑based access control, data quality validation, native Python readers, broader database connectors, Kafka Connect and LogStash sinks, Presto connector, performance enhancements for MOR, vectorized execution, and local caching.

For more details, visit the GitHub repository: https://github.com/lakesoul-io/LakeSoul .

Big DataFlinkreal-time analyticsopen sourceSparkData LakehouseLakeSoul
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.