Backend Development 15 min read

Design and Optimization of Alibaba's Notify and MetaQ Distributed Message Middleware

The article explains how Alibaba's Notify and MetaQ middleware achieve high‑performance, reliable distributed messaging and final consistency for massive e‑commerce transactions, detailing their architecture, design principles, scalability, fault‑tolerance, and the specific optimizations applied for the Double‑11 shopping festival.

Architect

Aug 6, 2015

Design and Optimization of Alibaba's Notify and MetaQ Distributed Message Middleware

Message Middleware – The Broadcaster of Distributed Messages

Message middleware is a typical middleware technology composed of message delivery mechanisms or queue patterns, enabling reliable asynchronous communication between applications or components, reducing coupling, and improving system scalability and availability.

3.1 Notify

Notify is a self‑developed messaging engine at Taobao, core to the Double‑11 system and widely used in Taobao and Alipay transaction scenarios. Its core functions are decoupling, asynchrony, and parallelism.

In a user‑registration flow, serial execution of ten steps (user DB write, red‑packet service, Alipay account creation, SNS notification, etc.) leads to long latency. Parallel execution reduces latency but introduces a new problem: the overall registration cannot finish until the slowest parallel task completes, potentially causing long user‑visible delays and higher failure risk.

The solution is to treat the user registration as the only critical step for the user, while subsequent tasks can be performed asynchronously, achieving eventual consistency through a message‑queue system.

Core Principles of Notify

Notify differs from traditional MQ in two main ways:

Designed for message accumulation.

No single point of failure and freely scalable architecture.

Most commercial MQ products focus on point‑to‑point transmission and aggressively use memory for performance, which does not suit large‑scale distributed scenarios with many consumers and unstable back‑ends. Notify instead persists messages to disk before asynchronous delivery, sacrificing peak throughput for stability and reliability.

Notify’s architecture consists of five core components:

Message‑sending cluster (stateless application machines that can scale up or down).

Configuration server cluster (detects node up/down events and broadcasts changes).

Notify server cluster (handles actual message sending/receiving; any server can fail without affecting the system).

Storage cluster (supports both multi‑copy disk storage for high safety and memory storage for high throughput, all stateless).

Message‑receiving cluster (business‑side consumers that also scale dynamically).

3.2 Notify Preparation and Optimization for Double‑11

During Double‑11, Notify handled massive traffic, writing 600 k messages per second and stacking 450 million messages without degradation. The system remained stable and met all delivery requirements.

3.3 MetaQ

MetaQ is a Java‑based queue‑model middleware, open‑sourced in 2012, evolving from Kafka. It provides a persistent disk queue with high reliability, leveraging OS cache for performance, and supports distributed producers, consumers, strict ordering, rich pull modes, and billion‑level message accumulation.

MetaQ’s storage redesign supports tens of thousands of queues per machine, using a two‑file approach: one file for sequential message writes and another for indexing message locations, achieving both reliability and high queue count.

MetaQ vs. Notify

Notify focuses on transaction messages and distributed‑transaction scenarios, while MetaQ targets ordered messages such as binlog synchronization and pull‑based use cases like stream processing.

3.4 MetaQ Preparation and Optimization for Double‑11

To reduce latency in the Taobao live‑stream trading system, several optimizations were applied: MySQL instance scaling, separating binlog storage from data storage, and tuning binlog generation to eliminate delays.

Cluster‑specific parameter tuning (batch sizes, flush strategies, data TTL, I/O scheduling, virtual memory) further improved throughput and message accumulation handling.

Real‑time monitoring and alerting were implemented, with dynamic adjustments via Diamond, ensuring rapid fault detection and high availability.

During Double‑11, the system processed 11.2 billion messages from Taobao, 2.4 billion from Alipay, with peak write rates of 13.1 k messages per second and delivery peaks of 27.8 k, achieving sub‑second latency for live‑stream trading and zero oversell.

Conclusion

Alibaba’s distributed message middleware now serves over 500 business applications across the group, handling more than 350 billion messages daily with high reliability, performance, and distributed‑transaction support, making it one of the most mature middleware products in the company.