Red Hat Ceph Storage Architecture Overview and Key Components
This article provides a comprehensive English translation of the Red Hat Ceph Storage Architecture Guide, covering Ceph's distributed object storage concepts, cluster architecture, storage pools, CRUSH algorithm, replication and erasure‑coding I/O, internal operations, high‑availability mechanisms, client interfaces, and encryption considerations for cloud environments.
Chapter 1 Overview
Red Hat Ceph is a distributed object storage system designed for high performance, reliability, and scalability, supporting modern and legacy object interfaces such as native language bindings (C/C++, Java, Python), RESTful S3/Swift APIs, block device, and file system interfaces.
Ceph can scale to thousands of clients and petabyte‑to‑exabyte data volumes, making it suitable for cloud platforms like RHEL OSP.
The core of any Ceph deployment is the Ceph storage cluster, which consists of two main daemon types:
Ceph OSD daemon : stores data, performs replication, rebalancing, recovery, health monitoring, and status reporting.
Ceph Monitor daemon : maintains a master copy of the cluster map.
Clients interact with the cluster using a configuration file (or cluster name and monitor addresses), a pool name, and user credentials.
Chapter 2 Storage Cluster Architecture
The cluster provides data storage, replication, health monitoring, dynamic rebalancing, integrity checking, and failure recovery while remaining transparent to client interfaces.
2.1 Storage Pools
Pools logically partition data and can be configured for different types (replicated or erasure‑coded). They define pool type, placement groups (PGs), CRUSH rule sets, and persistence methods.
2.2 Authentication (CephX)
CephX uses shared secret keys for mutual authentication between clients and monitors, providing protection against man‑in‑the‑middle attacks.
2.3 Placement Groups (PGs)
Objects are hashed into PGs, which are then mapped to an acting set of OSDs via the CRUSH algorithm, enabling dynamic data placement and rebalancing.
2.4 CRUSH Algorithm
CRUSH maps objects to PGs and PGs to OSDs based on a hierarchical bucket topology, supporting fault‑domain and performance‑domain isolation.
2.5 I/O Operations
Clients obtain the latest cluster map from a monitor, then use the object ID, pool name, and CRUSH to compute the target PG and primary OSD. The primary OSD coordinates writes to replica OSDs.
2.5.1 Replicated I/O
The primary OSD writes the object to replica OSDs; once acknowledgments are received, the client is notified of success.
2.5.2 Erasure‑Coding I/O
Data is split into K data blocks and M coding blocks; the primary OSD distributes these blocks across OSDs, enabling reconstruction when up to M OSDs fail.
2.6 Internal Self‑Management Operations
Heartbeat – OSDs report up/down status to monitors.
Sync – OSDs synchronize PG state automatically.
Rebalancing – New OSDs cause a small fraction of data to migrate based on CRUSH.
Scrubbing – Periodic verification and cleaning of object metadata and data.
2.7 High Availability
Data Replication – Default three‑copy replication; writes require at least two clean copies.
Mon Cluster – Multiple monitors provide quorum and avoid single‑point failure.
CephX – Provides secure, key‑based authentication without a single monitor bottleneck.
Chapter 3 Client Architecture
Ceph offers block devices (RBD), object gateway (RGW), and CephFS, all built on the RADOS protocol.
3.1 Native Protocol and librados
Provides direct, parallel object access with operations such as pool management, snapshots, read/write, XATTR and key/value handling, and compound operations.
3.2 Object Watch/Notify
Clients can register persistent watches on objects and receive notifications from the primary OSD.
3.3 Exclusive Locks
Allows a single client to obtain an exclusive lock on an RBD image, preventing concurrent writes.
3.4 Object Map Index
Tracks existence of RADOS objects in client memory to avoid unnecessary OSD queries for non‑existent objects, improving operations such as resize, export, copy, flatten, delete, and read.
3.5 Data Striping
Striping splits data across multiple objects to improve throughput; parameters include object size, stripe width, and stripe count. Ceph’s CRUSH algorithm then maps striped objects to PGs and OSDs.
Chapter 4 Encryption
LUKS can encrypt OSD data and journal partitions. Ceph‑ansible uses ceph-disk to create encrypted partitions, a lockbox partition, and stores LUKS keys in the monitor’s KV store. At service start‑up, OSDs automatically decrypt their data using the stored keys.
For detailed steps, refer to the Red Hat Ceph Storage Installation Guide.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.