Inside Apache Paimon 1.4: Core Principles and Design of an AI Multimodal Data Lake
Apache Paimon 1.4 redefines itself as an AI multimodal data lake by introducing row tracking, data evolution, Blob and Vector tables, Variant shredding, and Lumina‑BTree global indexing, each explained with concrete examples, configuration flags, and storage layouts that illustrate how the new capabilities enable unified storage and efficient retrieval of diverse data types.
Apache Paimon 1.4 was released and the project now brands itself as an "AI multimodal data lake". The article breaks down the six core capabilities that form the technical stack, showing how they depend on each other.
Multimodal Capability Overview
Row Tracking
Data Evolution
Blob Table
Vector Storage
Variant Shredding
Lumina + BTree Global Index
Row Tracking provides an immutable per‑row identifier _ROW_ID (enabled with row-tracking.enabled=true) that serves as the foundation for all later evolution features.
Data Evolution enables row‑level upserts, partial column appends without rewriting existing data, and MERGE INTO operations (controlled by data-evolution.enabled=true).
Blob Table introduces large‑object storage for images, videos, audio, and model weights via fields blob-field, blob-descriptor-field, and blob-external-storage-path.
Vector Storage adds a first‑class VECTOR<t, n> type (exposed as vector-field and field.<name>.vector-dim) that stores embeddings in separate .vector.lance files.
Variant Shredding provides columnar decomposition of semi‑structured Variant columns together with a clipped‑read writer/reader, allowing unstable JSON‑like event logs to be stored as Parquet columns.
Lumina (DiskANN) replaces previous Lucene/FAISS indexes with a next‑generation ANN index for exabyte‑scale samples, while a BTree global index supports high‑performance scalar lookups and row‑range pruning. The two indexes are combined in a pre‑filter step: BTree filters first, then ANN searches a much smaller vector space.
Detailed Walk‑through of Each Capability
Blob Table When a table contains a BLOB column, Paimon automatically splits the data into two file families: structured columns (e.g., id , name ) are stored in Parquet files, while large objects are stored in separate .blob files. Example layout:
table/
├── bucket-0/
│ ├── data-uuid-0.parquet # columns: id, name (column pruning possible)
│ ├── data-uuid-1.blob # picture bytes
│ ├── data-uuid-2.blob
│ └── ...
├── manifest/
├── schema/
└── snapshot/Because the BLOB data is isolated, queries that only need id and name can skip reading the large BLOB files, and the BLOB files can be rolled based on a configurable size.
Vector Storage The new VECTOR<t, n> type is a first‑class citizen alongside Blob. Vector columns are stored independently as Lance files ( .vector.lance ) while Parquet stores pointers to those files. The vector field is declared in SQL with vector-field and field.<name>.vector-dim . Separate Lance files are used because Lance provides optimizations specifically for vector column access, as discussed in the referenced article "Daft + Ray + Lance: Building the Next‑Generation Multimodal Data Pipeline".
Lumina Vector Index + BTree Global Index Lumina (DiskANN) serves as the new ANN index for billions‑scale vectors, replacing earlier Lucene/FAISS solutions. The BTree global index enables fast scalar lookups and row‑range pruning (e.g., row-range ). A pre‑filter workflow first applies BTree filtering, then runs ANN search, dramatically reducing the vector search space. Key takeaway: the combination of the two indexes maximizes retrieval efficiency.
Data Evolution Blob and Vector capabilities are gated by the Data Evolution control plane. Enabling row-tracking.enabled=true assigns an immutable _ROW_ID to each row, while data-evolution.enabled=true allows partial column updates and MERGE INTO operations based on that identifier. This makes it possible to add new feature columns (e.g., embedding columns) without rewriting existing data.
Variant Shredding Introduced in version 1.1 and completed in 1.4, Variant shredding automatically infers schema for semi‑structured Variant columns, supporting configurable maximum width, depth, field cardinality ratio, and buffer rows. The result is that event‑log or telemetry JSON data is stored as Parquet columns, allowing column pruning and avoiding full JSON parsing at query time.
PyPaimon Native Interface Version 1.4 adds a pure‑Python SDK with no JDK dependency, providing a native interface that bridges the data lake to AI workloads without additional runtime overhead.
Before 1.4, Paimon’s value proposition centered on "real‑time + batch" processing. Starting with 1.4, the project adds the more aggressive claim of storing training samples, inference features, event logs, and vector embeddings all within a single table, a single metadata set, and a single transactional layer, effectively removing the last barrier between data lake storage and AI applications.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
