Uncovering Hidden Cache Failures: A Robustness Test of Memcached with spymemcached
This article details a comprehensive robustness test of a core public service system's caching layer, exposing how decreasing Memcached (MC) instances dramatically impacts TPS and latency, analyzes the underlying Ketama consistent‑hash algorithm, and proposes concrete improvements to mitigate such failures.
Robustness Test
The system, a core public service platform, normally handles 330,000 QPS with sub‑millisecond latency, but this performance assumes all components are healthy; the article explores the impact of component failures and presents a full robustness test that uncovered a middleware algorithm issue.
Interface Cache Mechanism
The interface layer uses a distributed cache (memcache) and a local cache. The cache lookup flow is:
When a request arrives, look up in Memcached (MC); if hit, return the value.
If MC miss, look up the local cache; if hit, return the value and asynchronously write it back to MC.
If both caches miss, fetch from the DB, return the result, and asynchronously write to the local cache then to MC.
Test Plan
100 SKUs are mapped to keys in MC; the 100 keys are evenly distributed across 15 MC instances. Each request uses 10 or 100 SKUs, runs on 10 load generators with 10 threads each, and gradually shuts down MC instances until only one remains, then restarts them to observe service degradation.
Test Results
Scenario 1: 10 SKUs per request
Scenario 2: 100 SKUs per request
Observed Degradation When MC Instances Drop Below 9
When the number of alive MC instances falls from 15 to 10, results match expectations. However, with only 9 instances alive, TPS drops sharply and response time rises; under a 100‑SKU load, TPS falls to less than one‑tenth of the original and latency exceeds 20×.
Problem Discovery Process
According to the original design, when an MC instance goes down, its key‑value pairs should be rehashed to surviving nodes, and subsequent requests should retrieve the data from those nodes. The test confirmed that rehashing occurs, but the interface service sometimes reports the key as missing, contradicting the design.
Memcached
When TPS sharply declines, the hit rate of the remaining MC instances is examined. The hit rate shows large fluctuations, inconsistent with the expected behavior.
MySQL
Database QPS remains around 50 throughout the test, indicating that most interface responses are served from the cache rather than the DB.
Mechanism and Algorithm
The system communicates with the memcached cluster via the Java client spymemcached , which relies on Java NIO for asynchronous, single‑threaded operation.
spymemcached Overview
Features
Efficient storage
Strong tolerance to server and network interruptions
Full asynchronous API
Single‑threaded simplicity and efficiency
Optimized for high‑throughput processing
Core Components
MemcachedClient : API façade exposed to users.
MemcachedConnection : Manages connections to MC instances.
NodeLocator : Finds the MC node for a given key; supports algorithms such as ArrayModNodeLocator and KetamaNodeLocator. The system uses the Ketama consistent‑hash implementation.
MemcachedNode : Handles the actual NIO‑based communication with a single MC instance.
Operation : Abstract representation of any MC operation passed between MemcachedConnection and MemcachedNode .
Ketama Algorithm
Each MC instance is represented by 128 virtual nodes on the hash ring. When a key is hashed, it maps to a virtual node; if the corresponding MC instance is alive, the value is returned. If not, the next virtual node is tried, up to seven attempts before the request fails.
public final class KetamaNodeLocator extends SpyObject implements NodeLocator {
...
public Iterator<MemcachedNode> getSequence(String k){
// Seven searches gives us a 1 in 2^7 chance of hitting the same dead node all of the time.
return new KetamaIterator(k, 7, getKetamaNodes(), hashAlg);
}
...
}If after seven attempts no alive MC is found, the request waits 400 ms before returning.
Impact of Ketama on Test Scenario
In this scenario each SKU maps to a distinct MC instance; with 15 × 128 virtual nodes on the ring, a higher proportion of downed MC instances dramatically increases the chance of missing a live node, especially when many keys are requested simultaneously.
Improvement Suggestions
Paginate requests so that each batch contains no more than 10 SKUs; with one‑third of MC instances still alive, TPS only drops to about half of the original, which is acceptable.
Introduce a custom NodeLocator subclass that exposes the retry count, e.g.:
public Iterator<MemcachedNode> getSequence(String k) {
return new KetamaIterator(k, retryTimes, getKetamaNodes(), hashAlg);
}Expose the default 400 ms timeout for possible intervention.
Takeaways
The root cause is now clear, and the issue likely exists in any system using spymemcached. Immediate cache‑logic audits and business/system‑logic refinements are recommended to prevent similar failures.
Final Note
All middleware clusters (e.g., MC, Redis, Zookeeper) can be evaluated with similar single‑instance/multi‑instance fault simulations to assess failover behavior and performance during degradation.
Vipshop Quality Engineering
Technology exchange and sharing for quality engineering
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.