Master Election and Load Balancing in WMB Distributed Message Queue
The article explains how WMB, a high‑performance distributed message queue, uses Paxos‑based master election, periodic lease renewal, and a multi‑stage load‑balancing strategy to evenly distribute master nodes across groups, improve throughput, and ensure consistent message delivery.
WMB is a self‑developed distributed message queue featuring high reliability, availability, low latency, and high throughput. By integrating the Paxos consistency algorithm, it achieves strong master‑slave data consistency and rapid automatic failover.
To increase throughput, WMB groups data by Paxos groups, batching proposals after a time or size threshold. Each topic’s client sends data only to its group’s master, and consumers pull from that master. Imbalanced master distribution can create performance bottlenecks, so WMB provides a master load‑balancing strategy that periodically re‑elects masters to achieve balanced distribution.
The master election process extends the basic Paxos algorithm by introducing a master role that must be unique at any time. Nodes can invoke a BeMaster operation when no master exists or when they are already the master, starting a lease timer (T₁ + timeout). The lease expiration is calculated as T3 = timeout + T1 , ensuring no other node can pre‑empt the master during its term.
Master lease renewal is simple: the current master periodically performs BeMaster within its lease period, extending the lease. All nodes periodically attempt BeMaster only when they have no master or are already the master.
Load balancing aims to distribute master nodes evenly across groups. The system periodically computes the ideal master distribution, identifies groups with mismatched masters, and triggers a “抢主” (master‑steal) operation on the smallest‑indexed problematic group. Only one group undergoes stealing at a time to maintain availability.
Three evolution schemes are described:
Scheme 1: The current master voluntarily stops renewing its lease, allowing the ideal node to become master after the lease expires.
Scheme 2: The master’s lease expiration time is advanced via a Paxos propose, shortening the waiting period.
Scheme 3 (final): The master’s lease is moved forward to the present, instantly expiring it; the ideal node then immediately performs BeMaster , merging lease‑advance and master‑steal into a single propose.
The overall workflow starts each service node by fetching the group‑master map from the registry, calculating erroneous groups, and initiating the steal process for the smallest erroneous group. Preparation steps ensure consumer offsets are synchronized before stealing, and producers forward messages to the upcoming master to achieve a seamless transition.
Performance tests show that a balanced master distribution yields significantly higher QPS across various message sizes compared to a scenario where all masters reside on a single node.
In summary, the article presents the design and iterative optimization of WMB’s master load‑balancing mechanism, demonstrating how engineering‑level enhancements to Paxos meet the specific requirements of a high‑throughput distributed message system.
58 Tech
Official tech channel of 58, a platform for tech innovation, sharing, and communication.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.