Fundamentals 20 min read

Linux blk-mq Multi-queue Block Device Layer Framework and Implementation

The Linux blk-mq framework replaces the legacy single-queue block layer with a two-queue architecture—per-CPU software queues and hardware dispatch queues—eliminating lock contention and interrupt overhead, pre-allocating request tags, and supporting modern multi-queue I/O schedulers to fully exploit high-IOPS SSD performance.

OPPO Kernel Craftsman
OPPO Kernel Craftsman
OPPO Kernel Craftsman
Linux blk-mq Multi-queue Block Device Layer Framework and Implementation

Traditional Linux block device layer (Block Layer) and IO schedulers like CFQ were designed for HDD storage, where hardware limitations (hundreds of IOPS, millisecond-level latency) were the primary bottleneck. However, with the emergence of high-speed SSDs capable of millions of IOPS, the traditional single-queue (blk-sq) framework has become a performance bottleneck due to software overhead.

To address modern high-IOPS, low-latency storage requirements, the multi-queue block device layer framework (blk-mq) was introduced. This article provides an in-depth analysis of the blk-mq framework and its code implementation.

Single-Queue Framework and Its Problems:

The traditional Linux block layer uses a single-queue architecture with one request queue per block device. It provides unified interfaces for accessing different storage devices and generic services for storage device drivers. The main functions include: bio submission and completion handling, IO request staging (merging, sorting), IO scheduling (noop, cfq, deadline), and IO accounting.

However, the single-queue design has poor scalability for multi-core systems. The software overhead in blk-sq comes from three main sources: request queue lock contention (using spinlock q->queue_lock), hardware interrupts (high IOPS means high interrupt counts), and remote memory access (when the CPU submitting IO is not the CPU receiving hardware interrupts without shared cache). At high IOPS, approximately 80% of CPU time was spent on lock acquisition.

Multi-Queue Framework and Solutions:

Jens Axboe proposed the multi-queue (MQ) block device layer architecture (blk-mq), which uses a two-queue design to distribute lock contention across multiple queues:

Software Staging Queue: Each CPU gets its own software queue in blk-mq. Bio submission/completion, IO request staging (merging, sorting), marking, scheduling, and accounting all happen on this per-CPU queue, eliminating lock contention.

Hardware Dispatch Queue: blk-mq assigns one hardware dispatch queue per storage device hardware queue. During driver initialization, blk-mq maps one or more software queues to each hardware dispatch queue using a fixed mapping strategy.

With MQ architecture, only 3% of CPU time is spent on lock acquisition at high IOPS, dramatically improving IOPS throughput.

Code Analysis of blk-mq Framework:

blk-mq was merged in Linux-3.13, became a complete feature in Linux-3.16, and in Linux-5.0, the blk-sq code was completely removed, making MQ the default option.

Request and Tag Allocation:

In blk-mq, request and tag are bound together. The key data structures are blk_mq_tags (describes tag and request sets) and blk_mq_tag_set (describes storage device-related tag sets, abstracting IO characteristics). Unlike SQ which uses memory pools and allocates tags when requests are dispatched to drivers, MQ pre-allocates request memory during driver initialization (via blk_mq_alloc_tag_set) to avoid allocation overhead during IO operations. Tags serve as indices into the request array (static_rqs/rqs).

Request Queue Initialization:

blk_mq_init_queue initializes the IO request queue. The process includes: allocating queue memory aligned with the device's NUMA node, initializing the two-queue structure (per-CPU software queues and hardware dispatch queues), setting mq_ops, configuring make_request callback, and establishing software-to-hardware queue mappings.

IO Submission (submit):

blk_mq_make_request converts submitted bios into requests. The flow includes: attempting to merge with current thread's plug list, attempting to merge with current CPU's software queue, applying QoS throttling (wbt, io-latency cgroup, io-cost cgroup), allocating requests, and inserting them into appropriate queues based on request type (fua/flush, plug, scheduled, or direct dispatch).

IO Dispatch:

blk_mq_run_hw_queue dispatches IO requests to block device drivers, triggered from multiple points. The dispatch flow (via __blk_mq_run_hw_queue and blk_mq_sched_dispatch_requests) includes: dispatching from hardware queue's dispatch list, dispatching from scheduler queue if configured, dispatching from software queue using Round-Robin strategy if busy, or dispatching from all mapped software queues.

IO Completion:

Using UFS+scsi-mq driver as example, the completion flow involves: device triggering interrupt, driver checking interrupt status to identify completed requests, calling scsi_done to enter SCSI completion flow, blk_mq_ completing request through various paths (local softirq, remote IPI, or direct completion), and finally completing the bio via req_bio_endio.

Multi-queue IO Schedulers:

IO scheduler support was added to blk-mq in Linux-4.11. The main MQ schedulers include:

mq-deadline: Assigns deadlines to IO requests based on type (read/write), sorts by deadline, supports merging. Default MQ scheduler since Linux-4.11.

bfq: Budget Fair Queuing, merged in Linux-4.12, targets slow devices like HDDs. Provides IO sorting, priority, bandwidth allocation, and group scheduling. Not suitable for high-speed devices due to high software overhead.

kyber: True MQ scheduler for high-speed devices, merged in Linux-4.12. Sets different latency requirements for different IO types, monitors latency, and dynamically adjusts queue depths.

Linux kernelkernel developmentBlock Layerblk-mqio-schedulerMulti-QueueSSD OptimizationStorage Subsystem
OPPO Kernel Craftsman
Written by

OPPO Kernel Craftsman

Sharing Linux kernel-related cutting-edge technology, technical articles, technical news, and curated tutorials

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.