PacificA: Microsoft’s General Replication Framework for Large‑Scale Distributed Storage Systems
PacificA is Microsoft’s generic replication framework for large‑scale distributed storage systems that provides strong consistency, separates configuration management from data replication, and uses a primary‑secondary model with lease‑based fault detection to ensure availability, correctness, and efficient operation across heterogeneous nodes.
As information volumes surge, large‑scale distributed storage systems become essential, but they rely on inexpensive commodity hardware where failures are common, making fault tolerance critical for availability and reliability. While many correct replication protocols exist, practical systems must balance theoretical guarantees with overall performance.
Microsoft addressed this by creating PacificA, a simple, practical, and universally applicable replication framework that supports strong consistency and can host various replication strategies, aiding both understanding and comparison of different approaches.
Design Highlights
Provides a generic, abstract replication framework that is easy to verify and instantiate with different strategies.
Separates configuration management (handled by Paxos) from data replication (handled by primary‑secondary policies).
Integrates error detection and configuration updates into the replication interaction, decentralizing control and reducing bottlenecks.
System Architecture
The system consists of two main clusters: a storage cluster that stores data with multiple replicas for reliability, and a configuration‑management cluster that maintains replica metadata such as participating nodes, primary node, and version information.
Data is stored in fixed‑size shards, each represented as a large file on a storage node; a shard’s replicas form a Replication Group, and each node may belong to many groups.
Data Replication Flow
Updates enter the primary, receive a serial number (Sn), and are placed in a prepare list ordered by Sn.
The primary forwards the record to secondaries, which also insert it into their prepare lists.
Once the primary receives acknowledgments from all secondaries, it moves the record to the commit list and applies it to its state machine.
The primary then replies to the client and notifies secondaries to commit the record.
Configuration Management Service
It stores the topology of each replication group (replica addresses and version). Changes such as a secondary going offline, a primary failing, or a new node joining trigger topology updates that are reported to the service.
When a node reports a new topology, it increments the version; the service accepts the update only if the reported version is newer than its current one.
Node Failure Detection
PacificA uses a lease mechanism: the primary periodically sends heartbeats to secondaries, and secondaries send heartbeats back. If the primary does not hear from a secondary within the lease period, it demotes itself and removes the secondary from the group; if a secondary does not hear from the primary within a grace period, it reports the primary as failed and promotes itself.
Ensuring lease ≤ grace period prevents split‑brain scenarios, even under network partitions.
Failure Handling
If a secondary fails, the primary removes it from the group and may later elect a new primary.
If the primary fails, a secondary reports the failure, removes the primary from the group, and becomes the new primary after the configuration service approves.
After topology changes, a reconciliation phase synchronizes prepare and commit lists between the new primary and its secondaries, guaranteeing that committed records are never rolled back.
Log‑Based Synchronization
PacificA adopts a log‑structured approach where updates are written to a sequential log (LOG) with serial numbers, applied to an in‑memory state machine, periodically checkpointed to disk, and merged into on‑disk images. Variants include synchronizing only the log to secondaries or also transmitting checkpoints and on‑disk images, each with trade‑offs in network load and recovery time.
Layered Design
Some systems delegate reliability to the underlying distributed file system (e.g., BigTable on GFS, HBase on HDFS), a technique PacificA can also employ.
Reference
PacificA: Replication in Log‑Based Distributed Storage Systems (source: http://www.d‑kai.me).
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.