Fundamentals 30 min read

An Introduction to RDMA: Principles, Programming, and Applications

This article explains RDMA technology, covering its core principles, programming model with Verbs API, various communication modes, and its impact on data‑center networking, high‑performance computing, and distributed storage, highlighting its low‑latency, zero‑copy advantages over traditional TCP/IP.

Deepin Linux

Dec 25, 2024

An Introduction to RDMA: Principles, Programming, and Applications

In today's data‑intensive era, efficient data transmission is critical; RDMA (Remote Direct Memory Access) enables direct memory access between hosts, bypassing the CPU and kernel, thus reducing latency and CPU load.

1. RDMA Overview RDMA allows computers to read/write remote memory without involving the operating system’s network stack, solving the inefficiencies of traditional TCP/IP communication which requires multiple memory copies and context switches.

2. Core Principles The technology relies on kernel bypass, zero‑copy, and CPU offload. It provides a set of Verbs APIs that let applications interact with RNIC hardware directly, using memory registration, queue pairs (QP), and completion queues (CQ) to orchestrate data transfers.

3. Programming Model RDMA operations are divided into memory verbs (read, write, atomic) and messaging verbs (send, receive). Typical usage involves initializing the device, registering memory with ibv_reg_mr, creating protection domains, QPs and CQs, and posting work requests via ibv_post_send and ibv_post_recv. A simplified example:

#include <infiniband/verbs.h>

// Global variables
struct ibv_context *ctx;
struct ibv_pd *pd;
struct ibv_mr *mr;
struct ibv_qp *qp;
struct ibv_cq *cq;

// Initialize RDMA resources
void init_rdma() {
    struct ibv_device **dev_list = ibv_get_device_list(NULL);
    if (!dev_list) { perror("Failed to get RDMA device list"); exit(1); }
    ctx = ibv_open_device(dev_list[0]);
    if (!ctx) { perror("Failed to open RDMA device"); exit(1); }
    pd = ibv_alloc_pd(ctx);
    if (!pd) { perror("Failed to allocate protection domain"); exit(1); }
    char *buf = malloc(1024);
    mr = ibv_reg_mr(pd, buf, 1024, IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_READ);
    if (!mr) { perror("Failed to register memory region"); exit(1); }
    cq = ibv_create_cq(ctx, 10, NULL, NULL, 0);
    if (!cq) { perror("Failed to create completion queue"); exit(1); }
    struct ibv_qp_init_attr qp_attr = {
        .send_cq = cq,
        .recv_cq = cq,
        .cap = {.max_send_wr = 10, .max_recv_wr = 10, .max_send_sge = 1, .max_recv_sge = 1},
        .qp_type = IBV_QPT_RC
    };
    qp = ibv_create_qp(pd, &qp_attr);
    if (!qp) { perror("Failed to create queue pair"); exit(1); }
}

// Send data
void send_data() {
    struct ibv_send_wr wr, *bad_wr;
    struct ibv_sge sge;
    memset(&wr, 0, sizeof(wr));
    memset(&sge, 0, sizeof(sge));
    sge.addr = (uint64_t)mr->addr;
    sge.length = 1024;
    sge.lkey = mr->lkey;
    wr.sg_list = &sge;
    wr.num_sge = 1;
    wr.opcode = IBV_WR_RDMA_WRITE;
    wr.wr.rdma.remote_addr = remote_addr;
    wr.wr.rdma.rkey = remote_rkey;
    if (ibv_post_send(qp, &wr, &bad_wr)) { perror("Failed to post send"); exit(1); }
}

// Receive data
void receive_data() {
    struct ibv_recv_wr wr, *bad_wr;
    struct ibv_sge sge;
    memset(&wr, 0, sizeof(wr));
    memset(&sge, 0, sizeof(sge));
    sge.addr = (uint64_t)mr->addr;
    sge.length = 1024;
    sge.lkey = mr->lkey;
    wr.sg_list = &sge;
    wr.num_sge = 1;
    if (ibv_post_recv(qp, &wr, &bad_wr)) { perror("Failed to post receive"); exit(1); }
}

// Poll completion queue
void poll_cq() {
    struct ibv_wc wc;
    while (ibv_poll_cq(cq, 1, &wc)) {
        if (wc.status == IBV_WC_SUCCESS) {
            printf("RDMA operation completed successfully
");
        } else {
            perror("RDMA operation failed");
        }
    }
}

// Cleanup
void cleanup() {
    ibv_dereg_mr(mr);
    ibv_destroy_qp(qp);
    ibv_destroy_cq(cq);
    ibv_dealloc_pd(pd);
    ibv_close_device(ctx);
}

The sequence diagram of an RDMA transfer can be expressed as:

@startuml
actor SenderApp as Sender
actor ReceiverApp as Receiver
participant SenderHW
participant ReceiverHW
participant SendQueue
participant ReceiveQueue
participant CompletionQueue

Sender -> SendQueue: submit WQE
SendQueue -> SenderHW: notify task
SenderHW -> SenderHW: read data
SenderHW -> ReceiverHW: send data
ReceiverHW -> ReceiverHW: write data
ReceiverHW -> ReceiveQueue: post CQE
Receiver -> ReceiveQueue: poll for completion
SenderHW -> CompletionQueue: post CQE
Sender -> CompletionQueue: poll for completion
@enduml

4. Application Domains RDMA is widely used in data‑center networks, cloud storage, distributed databases, high‑performance computing (e.g., weather simulation, genome sequencing), and distributed storage systems such as Ceph, providing high bandwidth and low latency.

5. Communication Process RDMA operations involve establishing QPs, exchanging keys, and performing one‑sided reads/writes or two‑sided send/receive without kernel intervention, dramatically improving throughput.

6. Conclusion RDMA addresses the latency, bandwidth, and CPU overhead challenges of traditional networking, and despite current hurdles like hardware cost and network stability, ongoing advances promise broader adoption across cloud, HPC, and storage workloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

High Performance Computing zero-copy network programming RDMA Data Center

Written by

Deepin Linux

Research areas: Windows & Linux platforms, C/C++ backend development, embedded systems and Linux kernel, etc.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.