Fundamentals 9 min read

Analysis of Ceph Bluestore Storage Engine Architecture

This article examines Ceph's Bluestore storage engine, describing its raw‑device management, metadata structures, write and read I/O processing, allocation strategies, and clone handling, highlighting how it reduces write amplification and optimizes for SSDs compared to the older Filestore.

Architect
Architect
Architect
Analysis of Ceph Bluestore Storage Engine Architecture

Ceph supports multiple storage back‑ends via a plug‑in model; while Filestore was the default, its journal‑based write path caused significant write amplification and was not SSD‑optimized, prompting the development of Bluestore to manage raw devices directly and reduce overhead.

Bluestore bypasses traditional file systems (ext4/xfs) by using a user‑space BlockDevice with Linux AIO for I/O, employs allocators such as StupidAllocator and BitmapAllocator, and stores metadata in RocksDB through a custom BlueRocksEnv that wraps a lightweight BlueFS file system.

Metadata is represented by Onodes, in‑memory structures persisted as key‑value pairs in RocksDB; each Onode contains lextents that map logical data blocks to blobs (bluestore_blob_t), and each blob contains pextents that locate the actual physical regions on the device. Bnodes are used to track extents shared among multiple objects.

Write I/O arrives with offsets and lengths relative to an object. The engine first splits the request based on the minimum allocation size (min_alloc_size). Aligned writes are handled by do_write_big , creating new lextents and blobs as needed, while unaligned writes go to do_write_small , which attempts to reuse existing blob space and performs zero‑padding to satisfy block‑size alignment.

When no reusable blob is found, a new blob is allocated; if a suitable free region exists within a blob, the write is placed there. Overwrite scenarios require alignment, possible read‑modify‑write of unaligned portions, creation of new lextents, and adjustment of existing ones, with changes recorded in RocksDB via the write‑ahead log.

The overall write flow follows a series of steps illustrated in the original diagram, encompassing allocation, alignment, blob selection, lextent creation, and metadata updates.

Read I/O similarly locates the corresponding lextent; if the requested region falls into a hole (unwritten area), the engine returns zero‑filled data.

For clone operations, Bnodes record shared lextents. During a snapshot, the original object's blob references are moved to a Bnode, marked as shared, and the new snapshot's Onode points to the same Bnode. Subsequent writes to the original object that hit shared blobs trigger copy‑on‑write: a new blob is created, the reference in the original lextent is removed, and reference counting determines when the shared space can be reclaimed.

In summary, the article outlines Bluestore's architecture, its metadata model, and the detailed I/O mapping logic, noting that this is only an introductory overview and that future analyses will cover allocation strategies, caching, compression, and other components.

storage enginemetadatadistributed storageBluestoreCephI/O Architecture
Architect
Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.