Cache Avalanche Incident: Root Cause, Response, and Prevention Strategies
A recent flash‑sale failure caused by a cache avalanche was analyzed, revealing that setting a uniform two‑hour expiration for all items flooded the database, and the post outlines detection steps, emergency mitigation, and three proven techniques—uniform expiration, mutex locking, and never‑expire caches—to prevent recurrence.
The company launched a flash‑sale event that was supposed to start at midnight, but a backend mistake led to severe performance issues and complaints from users and agents.
At 22:00 the operations team published the products, and at 23:00 the backend engineer pre‑warmed the cache. The plan relied on Redis to handle most read requests, keeping the database safe.
However, the engineer set a two‑hour expiration for all cached items. When the cache expired simultaneously, all traffic hit the database, causing a crash and timeout errors—a classic cache avalanche.
The incident was first noticed at 01:02 when SRE alerts showed CPU and memory spikes on the database nodes, prompting immediate investigation.
Root causes included the uniform cache TTL and the lack of early detection because the cache was still serving most requests before expiration.
Emergency actions taken:
1. Restricted incoming traffic via API Gateway. 2. Restarted the failed database service. 3. Re‑warmed the cache. 4. After confirming normal operation, gradually released the traffic, restoring the flash‑sale by around 01:30.
To avoid future cache avalanches, three mature solutions were discussed:
Uniform expiration : Assign different TTLs or add random jitter so that cache items expire at staggered times.
Mutex lock : Ensure only one thread rebuilds a cache entry while others wait, preventing a thundering herd.
Never‑expire cache : Keep cache entries physically permanent and refresh them asynchronously.
The team concluded the post‑mortem with a reinforced understanding of cache avalanche risks and a commitment to respect every line of code.
IT Services Circle
Delivering cutting-edge internet insights and practical learning resources. We're a passionate and principled IT media platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.