Backend Development 13 min read

Design and Implementation of RocketMQ Tiered Storage

The article explains how RocketMQ 5.1.0 introduces a tiered storage module that offloads messages to cheaper media, describes its design, architecture layers, quick‑start configuration, upload and read mechanisms, prefetch cache, fault recovery, current development plans, and remaining challenges.

Wukong Talks Architecture
Wukong Talks Architecture
Wukong Talks Architecture
Design and Implementation of RocketMQ Tiered Storage

With the official release of RocketMQ 5.1.0, tiered storage becomes an independent module that reaches the Technical Preview milestone, allowing users to offload messages from local disks to cheaper storage media and extend message retention at lower cost.

Design Overview

Tiered storage aims to move data to other storage media without affecting hot‑data read/write. It serves two scenarios: (1) cold‑hot data separation – newly produced messages are cached in page cache as hot data; when the cache exceeds memory, hot data is swapped out to become cold data, and reading cold data from disk would cause I/O competition, which tiered storage avoids; (2) extending message retention – offloading messages to larger, cheaper media enables longer TTL per topic.

Quick Start

To enable tiered storage, users only need to modify two broker configurations:

Set messageStorePlugIn to org.apache.rocketmq.tieredstore.TieredMessageStore .

Configure the storage backend, e.g., tieredBackendServiceProvider to org.apache.rocketmq.tieredstore.provider.posix.PosixFileSegment and specify tieredStoreFilepath for the new storage path.

Optional: change tieredMetadataServiceProvider to switch metadata storage (default is JSON file).

Technical Architecture

Access Layer : TieredMessageStore , TieredDispatcher , TieredMessageFetcher implement part of the MessageStore interfaces with asynchronous semantics and performance optimizations such as dedicated thread pools and pre‑read cache.

Container Layer : TieredCommitLog , TieredConsumeQueue , TieredIndexFile , TieredFileQueue mirror the logical file abstractions of the default store, but the CommitLog is organized by queue dimension.

Driver Layer : TieredFileSegment maps logical files to physical files and delegates I/O to implementations of TieredStoreProvider (e.g., Posix, S3, OSS, MinIO).

Message Upload

The upload is triggered by the dispatch mechanism. When the broker starts, TieredDispatcher is registered as the CommitLog dispatcher. Each incoming message is written to an upload buffer and the call returns immediately, ensuring no blocking of local ConsumeQueue construction.

Upload progress is controlled by two offsets per queue:

dispatch offset : position of messages written to the buffer but not yet uploaded.

commit offset : position of messages already uploaded.

These offsets are persisted as metadata per topic/queue/file segment and recovered after broker restart.

Message Read

TieredMessageStore implements read interfaces with four strategies controlled by tieredStorageLevel :

DISABLE – never read from tiered storage.

NOT_IN_DISK – read messages not present in the default store.

NOT_IN_MEM – read cold data (not in page cache) from tiered storage.

FORCE – force all reads from tiered storage (testing only).

When a read request reaches tiered storage, TieredMessageFetcher validates parameters, converts the logical queue offset to a physical file offset via TieredConsumeQueue / TieredCommitLog , and reads the data from the appropriate TieredFileSegment .

Prefetch Cache

During reads, a portion of messages is prefetched into a cache to serve subsequent requests. The cache uses an algorithm inspired by TCP Tahoe: additive increase (grow by one batch size) and multiplicative decrease (halve when cached data expires without being consumed). The cache is shared among all consumer groups of a topic, and its eviction policy requires either all groups to have accessed the cache or the cache timeout to be reached.

Fault Recovery

Metadata for each topic/queue/file segment stores the commit and dispatch offsets. After a broker restart, these offsets are restored, allowing the system to resume uploads without losing buffered messages.

Development Plan & Challenges

The tiered storage aims to leverage cloud‑native object storage for low‑cost, long‑term retention and to benefit from shared storage in multi‑replica architectures. Current challenges include:

Metadata synchronization across nodes and handling missing metadata during slave promotion.

Preventing uploads beyond the confirm offset to avoid message rollback.

Rapidly starting tiered storage on slave promotion when only the master has write permission.

Future work also plans to add server‑side tag filtering and improve cache eviction for broadcast consumption and groups with divergent consumption rates.

References

For more details, see the RocketMQ tiered storage README .

/**
  * Asynchronous get message
  * @see #getMessage(String, String, int, long, int, MessageFilter) getMessage
  *
  * @param group Consumer group that launches this query.
  * @param topic Topic to query.
  * @param queueId Queue ID to query.
  * @param offset Logical offset to start from.
  * @param maxMsgNums Maximum count of messages to query.
  * @param messageFilter Message filter used to screen desired messages.
  * @return Matched messages.
  */
CompletableFuture
getMessageAsync(final String group, final String topic, final int queueId,
    final long offset, final int maxMsgNums, final MessageFilter messageFilter);
// TieredMessageFetcher#getMessageAsync similar with TieredMessageStore#getMessageAsync
public CompletableFuture
getMessageAsync(String group, String topic, int queueId,
        long queueOffset, int maxMsgNums, final MessageFilter messageFilter)
/**
  * Get data from backend file system
  *
  * @param position the index from where the file will be read
  * @param length the data size will be read
  * @return data to be read
  */
CompletableFuture
read0(long position, int length);
Cloud NativeBackend DevelopmentMessage QueueRocketMQTiered Storage
Wukong Talks Architecture
Written by

Wukong Talks Architecture

Explaining distributed systems and architecture through stories. Author of the "JVM Performance Tuning in Practice" column, open-source author of "Spring Cloud in Practice PassJava", and independently developed a PMP practice quiz mini-program.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.