Backend Development 17 min read

Improving Cache Invalidation and Consistency at Scale

Meta engineers describe the challenges of cache invalidation and consistency in large‑scale distributed systems, explain why stale caches are problematic, present their Polaris observability service and consistency‑tracking techniques, and detail how they raised TAO’s cache consistency from six‑nines to ten‑nines.

High Availability Architecture

Jun 24, 2022

Improving Cache Invalidation and Consistency at Scale

Caching reduces latency, improves read‑heavy workload scalability, and saves cost, but cache invalidation and consistency are notoriously hard problems. Phil Karlton famously said that the only two hard things in computer science are cache invalidation and naming. Meta operates some of the world’s largest caches (TAO and Memcache) and has improved TAO’s consistency from 99.9999% (six nines) to 99.99999999% (ten nines).

Definition of Cache Invalidation and Consistency

Cache invalidation is the process of actively expiring stale cache entries when the underlying data source changes. If handled incorrectly, stale values can persist indefinitely, leading to inconsistency between cache and source.

Invalidation requires an external program (client or subsystem) to notify the cache of data changes; simple TTL‑based expiration is out of scope.

Why Consistency Matters

In some cases, cache inconsistency is as severe as data loss from the database. Meta’s TAO service once suffered from “brain split” when different replicas stored divergent user messages after a region move, illustrating the user‑visible impact of stale caches.

Cache Invalidation Model

Static caches (e.g., simplified CDNs) have immutable data and no active invalidation. Dynamic caches like TAO and Memcache experience reads (cache fills) and writes (invalidations) on the same path, creating race conditions and making it difficult to track every state change.

TAO processes billions of queries daily; even with a 99% hit rate, it performs over 10 trillion cache fills per day, making exhaustive logging impractical.

Observability for Consistency

To address these challenges, Meta built Polaris , a service that measures cache consistency and alerts on violations without false positives. Polaris treats any observable anomaly to the client as a real error and focuses on invariants such as “cache eventually matches the database.”

Polaris queries all cache replicas after receiving an invalidation event to verify consistency, reporting violations on multiple time scales (e.g., 1 min, 5 min, 10 min). It also delays expensive database checks until a violation persists across a time‑scale boundary, reducing load on the primary store.

Polaris provides metrics like “N‑nine cache writes are consistent within M minutes.” Using this, Meta reports that 99.99999999% of TAO writes are consistent within five minutes—fewer than one inconsistency per ten billion writes.

Consistency Tracing

Polaris records cache mutations only during windows where inconsistencies are likely, capturing a “purple window” of competing writes and invalidations. This tracing is embedded in major cache services, buffers recent modifications, and supports code‑path tracing.

The approach has uncovered many defects and offers a scalable way to diagnose cache inconsistency.

Real‑World Bug Discovered This Year

A rare bug caused a cache entry to remain forever inconsistent: the cache stored metadata=0 @version4 while the database held metadata=1 @version4. The root cause was a flawed error‑handling path that deleted the entry only if its version was lower than the specified version, leaving the newer, wrong entry untouched. drop_cache(key, version); Polaris detected the anomaly and alerted engineers, who pinpointed and fixed the bug in under 30 minutes.

Future Work on Cache Consistency

Meta plans to push cache consistency even closer to 100%, address challenges of distributed secondary indexes, improve read‑time consistency measurements, and develop high‑level consistency APIs for distributed systems (e.g., C++ std::memory_order semantics).

Original article: https://engineering.fb.com/2022/06/08/core-data/cache-invalidation/

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Distributed Systems Observability Caching Consistency Cache Invalidation

Written by

High Availability Architecture

Official account for High Availability Architecture.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.