How MOSN’s New Latency‑Based Load Balancing Cuts Tail Latency
This article explains MOSN v1.5.0's latency‑based load‑balancing algorithm, analyzes the sources of latency in distributed systems, describes mitigation techniques such as PeakEWMA and P2C, and presents a realistic simulation that shows the algorithm’s superiority over Round Robin and Least Request.
What Is Load Balancing
Load balancing is the process of distributing a set of tasks across a set of resources (computing units) to make overall processing more efficient, improving response time and preventing some nodes from being overloaded while others remain idle.
Load balancing optimizes response time and avoids uneven load that can cause overload on certain compute nodes.
In large distributed systems, load balancing is a key component that solves two major problems: scalability and resilience.
Scalability: Applications are deployed on multiple identical replicas; additional replicas can be added when resources are insufficient, and excess replicas can be removed to save cost. Load balancing distributes request load among replicas.
Resilience: Failures in a distributed system are partial. Redundant replicas ensure service continuity, and load balancing redirects traffic away from failed nodes to healthy ones.
Why Latency Matters
Latency in a single stable server originates from task complexity, transmission delays, queueing delays, and resource contention from background activities. In multi‑server deployments these latencies accumulate, and any significant increase degrades end‑user experience. Cloud environments further exacerbate unpredictability due to shared CPU, memory, and I/O resources.
How to Reduce Latency
Research shows latency in large‑scale internet services has a long‑tail distribution; reducing tail latency at each architectural layer dramatically lowers overall user‑perceived latency. In a service mesh, sidecar proxies handle all inbound and outbound traffic; if the proxy selects the server with the shortest response time during load balancing, tail latency is significantly reduced.
Performance Issues Are Local
Node performance is influenced by many dynamic factors, making precise prediction difficult. In cloud environments, server performance often follows a normal distribution: most nodes perform normally while a small minority (≈3σ) perform poorly. Additionally, infrastructure‑induced dynamic latency (e.g., network congestion, faults, traffic spikes) tends to be persistent and localized.
PeakEWMA
To address these challenges, MOSN uses PeakEWMA (Peak Exponentially Weighted Moving Average) to compute a response‑time metric for load balancing. EWMA applies exponentially decaying weights, giving recent measurements higher influence while still considering older data. The decay rate is controlled by a constant α (0 ≤ α ≤ 1). This method requires few samples and no fixed time window, making it suitable for sidecar proxies.
Because response time alone is a historical metric, MOSN also incorporates the number of active connections as a real‑time factor, weighting response time by the maximum wait time for all active connections.
P2C (Power of Two Choice)
Scanning all servers to find the lightest load is impractical at scale. P2C selects two random servers, compares their loads, and forwards the request to the lighter one, achieving near‑optimal balancing in constant time. It bases decisions on the probability that one server outperforms another rather than static weights. When multiple load balancers have differing views of node health, P2C still yields stable, evenly distributed traffic.
In MOSN v1.5.0, P2C is used when node weights are equal; otherwise, EDF (Earliest Deadline First) performs weighted selection. Future releases will expose a configurable option.
Simulation Traffic Validation
We built a test case that mirrors production performance distribution to validate the algorithm.
First, we generated baseline performance for 10 servers using a normal distribution (mean = 50 ms, σ = 10 ms). Request latency was then generated with a normal distribution (σ = 5 ms). One server was injected with a 10% fault probability that adds a 1000 ms delay to test fault tolerance.
To emulate queueing delay under request skew, each server’s maximum concurrency was limited to 8; excess requests wait in a queue, reflecting realistic system behavior.
We compared Round Robin, Least Request, and PeakEWMA under 16‑concurrent request load and recorded the P99 latency.
Round Robin balances load but repeatedly selects the faulty server, causing P99 to hover around 1000 ms. Least Request avoids the faulty server but still shows large P99 fluctuations. In contrast, PeakEWMA maintains stability and consistently achieves lower P99 values than the other two algorithms, demonstrating MOSN’s performance gains.
Looking for More Stability
While service meshes accelerate request handling, failures remain inevitable in distributed systems. We aim for MOSN’s load‑balancing to make services not only faster but also more stable.
Fast‑Failure Challenge
Failures often produce response times far shorter than normal (e.g., connection timeouts). From a load‑balancing perspective, such fast failures can mistakenly be perceived as optimal servers. Circuit breakers can prevent continued requests to failing nodes, but setting appropriate thresholds is challenging and requires sufficient error samples.
Future versions will adjust the load‑balancing algorithm to detect errors early and avoid routing requests to faulty servers before circuit breakers trigger.
-end-
JD Cloud Developers
JD Cloud Developers (Developer of JD Technology) is a JD Technology Group platform offering technical sharing and communication for AI, cloud computing, IoT and related developers. It publishes JD product technical information, industry content, and tech event news. Embrace technology and partner with developers to envision the future.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.