Ceph Architecture Overview: Features, Core Components, IO Workflow, Heartbeat, Communication Framework, CRUSH Algorithm, and QoS
This article provides a comprehensive technical overview of Ceph, covering its architecture, key components, storage interfaces, IO processes, heartbeat mechanisms, communication framework, CRUSH data placement algorithm, and customizable QoS strategies for distributed storage systems.
Ceph Overview
Ceph is a unified distributed storage system designed for high performance, reliability, and scalability. Originating from research in 2004, it is now widely supported by cloud providers and integrates with platforms such as RedHat and OpenStack for backend VM image storage.
Key Characteristics
High performance using the CRUSH algorithm for balanced data distribution and parallelism.
High availability with flexible replica counts, fault‑domain isolation, and automatic self‑healing.
Scalability through decentralized design, allowing linear growth as nodes are added.
Rich feature set supporting block, file, and object storage interfaces, as well as custom drivers.
Architecture and Core Components
Ceph provides three storage interfaces:
Object : Native API compatible with Swift and S3.
Block : Supports thin provisioning, snapshots, and cloning.
File : POSIX interface with snapshot capability.
Core components include:
Monitor (MON) : Small cluster of monitors that use Paxos to store cluster maps.
Object Storage Device (OSD) : Processes that store data objects and handle client requests.
Metadata Server (MDS) : Provides metadata services for CephFS.
RADOS : The underlying reliable autonomic distributed object store.
CRUSH : Data placement algorithm.
RBD, RGW, CephFS : Block, object, and file services exposed to users.
Storage Types
Block Storage
Typical devices are disk arrays; advantages include RAID/LVM protection, increased capacity, and improved I/O throughput. Drawbacks are higher cost for SAN fabrics and lack of data sharing between hosts. Common use cases are VM disks, logs, and general file storage.
File Storage
Implemented via FTP or NFS servers, offering low cost and easy file sharing. Limitations are lower read/write and transfer speeds. Used for log storage and hierarchical file systems.
Object Storage
Built on large‑capacity distributed servers compatible with Swift/S3. Provides high‑speed block‑like I/O and sharing features of file storage. Ideal for infrequently changed data such as images and videos.
Ceph IO Workflow
Typical client IO steps:
Client creates a cluster handler and reads the configuration.
Client connects to monitors to obtain the cluster map.
Client uses the CRUSH map to locate the primary OSD for the target PG.
Primary OSD writes data and replicates to secondary OSDs.
After all replicas acknowledge, the client receives completion.
When a new OSD becomes primary, it reports to monitors, a temporary primary OSD takes over, synchronizes data, and later hands over the role once synchronization finishes.
IO Algorithm Details
File → Object mapping uses inode (ino) and object number (ono) to form an object ID (oid). The oid is hashed and masked to produce a PG ID, which the CRUSH algorithm maps to a set of OSDs.
locator = object_name
obj_hash = hash(locator)
pg = obj_hash % num_pg
osds_for_pg = crush(pg) # returns a list of osds
primary = osds_for_pg[0]
replicas = osds_for_pg[1:]RBD IO Process
Clients create a pool with a defined number of PGs, mount an RBD image, and write data in 4 MiB chunks. Each chunk becomes an object that is placed on three OSDs via the CRUSH algorithm. The OSDs format underlying disks (typically XFS) and store objects as files such as rbd0.object1.file .
Heartbeat Mechanism
Heartbeats detect node failures. OSDs listen on public, cluster, front, and back ports. They exchange PING/PONG messages within the same PG every ~6 seconds; missing replies for 20 seconds trigger failure handling. Monitors also receive periodic reports from OSDs and aggregate failure information before marking an OSD down.
Communication Framework
Ceph supports three networking models: simple thread‑per‑connection, async I/O multiplexing (default), and XIO using the Accelio library. The framework follows a publish/subscribe pattern where a Messenger publishes messages to Dispatcher subclasses via Pipe objects. Messages consist of a header, user data (payload, middle, data), and a footer.
class Message : public RefCountedObject {
protected:
ceph_msg_header header; // message header
ceph_msg_footer footer; // message footer
bufferlist payload; // front unaligned blob
bufferlist middle; // middle unaligned blob
bufferlist data; // data payload (page‑aligned when possible)
utime_t recv_stamp; // receive start timestamp
utime_t dispatch_stamp; // dispatch timestamp
utime_t throttle_stamp; // throttle timestamp
utime_t recv_complete_stamp; // receive complete timestamp
ConnectionRef connection; // network connection
uint32_t magic = 0; // magic number
bi::list_member_hook<> dispatch_q; // intrusive list hook
};
struct ceph_msg_header {
__le64 seq; // per‑session sequence number
__le64 tid; // global message ID
__le16 type; // message type
__le16 priority;
__le16 version;
__le32 front_len;
__le32 middle_len;
__le32 data_len;
__le16 data_off;
ceph_entity_name src; // source entity
__le16 compat_version;
__le16 reserved;
__le32 crc; // header CRC32C
} __attribute__((packed));
struct ceph_msg_footer {
__le32 front_crc, middle_crc, data_crc;
__le64 sig; // 64‑bit signature
__u8 flags; // end flag
} __attribute__((packed));CRUSH Algorithm
CRUSH (Controlled Scalable Decentralized Placement of Replicated Data) maps placement groups (PGs) to OSDs using a hierarchical cluster map and placement rules. It supports various bucket types (uniform, list, tree, straw) to achieve balanced data distribution, fault‑domain awareness, and efficient scaling.
rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}Customizable QoS
Ceph’s QoS aims to allocate limited I/O resources (bandwidth, IOPS) among users. The official mClock scheduler provides reservation, weight, and limit parameters. A simpler token‑bucket implementation can also enforce rate limits by classifying requests, consuming tokens proportional to request size, and throttling when tokens are insufficient.
Author Information
Written by Li Hang, a senior developer with extensive experience in high‑performance Nginx, Redis Cluster, and Ceph, currently working at Didi’s infrastructure platform.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.