Operations 20 min read

Optimizing NVMe SSD Tiered Storage for Advertising Retrieval: Methodology and Read Scheduling

The paper presents a tiered NVMe SSD solution for ad‑ranking retrieval, using zone‑aware hardware, Direct I/O with libaio/IoUring, and an adaptive pipeline scheduler to cut long‑tail read latency and boost throughput, enabling larger‑than‑memory signals and improving ranking accuracy and revenue.

Baidu Geek Talk
Baidu Geek Talk
Baidu Geek Talk
Optimizing NVMe SSD Tiered Storage for Advertising Retrieval: Methodology and Read Scheduling

This article discusses the need for larger storage capacity in advertising retrieval systems, where the volume of signals required at the coarse‑ranking stage far exceeds memory limits. To address this, a tiered storage solution based on NVMe SSDs is proposed.

Business Background : The ad ranking service stores billions of ad materials and query contexts in memory. Introducing URL plaintext dramatically increases memory usage, pushing per‑instance memory quotas beyond cloud‑native limits. Consequently, a scalable storage layer is required.

Technical Background : SSDs suffer from uncontrolled read/write interference and write amplification due to page‑level I/O and erase‑before‑write constraints. Traditional software optimizations (e.g., aligning writes, bulk deletions) only partially mitigate long‑tail latency.

New hardware standards such as OpenChannel and Zoned Namespace SSD (ZNS) expose zone‑level interfaces, allowing finer‑grained control over data placement and command scheduling, which can reduce read‑latency tails.

Solution Overview (Ecomm Uniform SSD Layer – SsdEngine) : The project integrates various hardware (NVMe, ZNS) behind a unified interface tailored for advertising workloads. It focuses on controlling the long‑tail of read latency while maintaining high throughput.

NVMe Long‑Tail Control : By fixing the read unit size (ValueSize = 4 KB) and adjusting the read/write throughput ratio (IOPS pattern), the long‑tail can be effectively managed. Benchmarks on Intel P3600 show optimal read‑write ratios for 5 ms latency targets.

Direct I/O and Asynchronous Access : Switching from PageCache to Direct I/O (DIO) gives hardware‑level control. To compensate for the loss of cache benefits, Libaio and IoUring are employed for asynchronous I/O. The article provides a concrete Libaio example:

io_queue_init(iodepth, context);
for (int i = 0; i < nr; i++) {
    io_prep_pread(iocb_list[i], fd, page[i], page_size[i], page_offset[i]);
}
io_submit(context, nr, iocb_list /*start*/);
while (completed < nr) {
    int res = io_getevents(context, min_wait_nr, max_nr - completed, events, &ts);
    for (int i = 0; i < res; i++) {
        assert(events[i] == page_size[index]);
    }
    completed += res;
    if (total_time_cost > pv_timeout) {
        io_cancel();
    }
}

Similarly, an IoUring integration example is shown:

io_uring_queue_init(iodepth, io_uring, 0 /*flags*/);
for (int i = 0; i < nr; i++) {
    io_uring_prep_readv(sqe, fd, iovecs[i], 1, page_offset[i]);
}
while (completed < nr) {
    io_uring_wait_cqes(_s_p_local_io_uring, &wait_cqe, wait_nr, &ts, nullptr);
    io_uring_for_each_cqe(io_uring, head, cqe) {
        if (cqe->user_data == LIBURING_UDATA_TIMEOUT) continue;
        process(cqe);
        io_uring_cq_advance(io_uring, 1);
        completed++;
    }
}

Adaptive Scheduler : The system detects the kernel version at runtime. On kernels ≤ 5.10 it uses the AioPageScheduler (Libaio); on newer kernels it prefers the UringPageScheduler (IoUring), falling back automatically.

Pipeline Design : The original schedule‑submit‑wait flow is transformed into a pipeline where batch submission and early waiting overlap with hash‑table lookups. Two tunable parameters— batch_submit (number of I/O tasks submitted at once) and iodepth_low (threshold to start refilling the queue)—allow balancing system‑call overhead and latency reduction.

Benchmarks on a Xeon Platinum 8350C (kernel 5.10) show that enabling the pipeline reduces average I/O latency by up to 30 % in large‑value scenarios and by ~15 % in massive‑KV scenarios. IoUring provides an additional ~8 % throughput gain over Libaio.

Application Impact : Introducing URL plaintext in the ad ranking pipeline increased QLQ accuracy by 10.8 pp and boosted revenue. The tiered storage approach enables further scaling of such signals.

Related Work : The article positions the problem within the broader context of larger‑than‑memory data management, referencing research on hot/cold execution paths, hierarchical storage, and LSM‑tree vs. HashKV indexing strategies.

performance optimizationSSDAd RetrievalNVMeTiered StorageIoUringLibaio
Baidu Geek Talk
Written by

Baidu Geek Talk

Follow us to discover more Baidu tech insights.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.