Cloud Native 12 min read

LightPool: A Cloud‑Native NVMe‑oF Based High‑Performance Storage Pool Architecture for Distributed Databases

The article introduces LightPool, an open‑source, cloud‑native storage‑pool architecture presented at HPCA 2024, which leverages NVMe‑over‑Fabric, Kubernetes CSI integration, and a lightweight user‑space engine to deliver high‑performance, elastic, and highly available storage for large‑scale distributed databases while reducing cost and improving resource utilization.

AntTech
AntTech
AntTech
LightPool: A Cloud‑Native NVMe‑oF Based High‑Performance Storage Pool Architecture for Distributed Databases

From March 2‑6, 2024, the 30th IEEE International Symposium on High‑Performance Computer Architecture (HPCA) in Edinburgh accepted a paper titled "LightPool: A NVMe‑oF‑based High‑performance and Lightweight Storage Pool Architecture for Cloud‑Native Distributed Database," highlighting the authors' open‑source product LiteIO.

The paper explains that modern cloud‑native databases face performance, cost, and stability pressures, prompting the design of a novel storage‑pool architecture that matches local storage performance while providing elasticity and reducing storage costs.

Three traditional database storage models are compared: compute‑storage coupling, compute‑storage separation (ECS + EBS/S3), and shared storage architectures, each with trade‑offs in performance, cost, and scalability.

LightPool addresses these issues by pooling idle storage resources across heterogeneous ECS nodes, decoupling disk scheduling from container scheduling, and exposing storage via a Kubernetes‑native CSI plugin.

The cluster consists of control nodes that manage SSD pools and interact with the Kubernetes master, and worker nodes that run containers and the LightPool storage engine; storage traffic uses NVMe‑over‑Fabric (TCP/RDMA) for high‑speed data paths.

Scheduling follows a Kubernetes‑like framework with basic, affinity, and custom filters, and priority rules that favor nodes with fewer resources used; two approaches ensure correct local‑disk scheduling, either via extended node resources or integration with the Scheduler Framework.

The storage engine is user‑space, lightweight, and supports zero‑copy local protocols, multiple media types, snapshots, and RAID, enabling cost‑effective use of SSDs, QLC caches, and ZNS devices.

High‑availability is achieved through hot‑upgrade (sub‑second downtime) and hot‑migration techniques, allowing seamless updates and data movement between worker nodes without service interruption.

LightPool is released as an open‑source project (https://github.com/eosphoros-ai/liteio) and invites contributions from the community.

cloud-nativeDistributed Databaseopen-sourcestorageHigh PerformanceNVMe-oF
AntTech
Written by

AntTech

Technology is the core driver of Ant's future creation.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.