Databases 17 min read

How OceanBase Guarantees Data Reliability and Service High‑Availability

The article explains how OceanBase, a distributed enterprise‑grade database, achieves strong data reliability and rapid service recovery on ordinary PC servers by combining Paxos‑based consensus, enhanced redo‑log verification, periodic checkpoint checks, and fine‑grained fail‑over mechanisms, surpassing traditional hardware‑dependent databases.

AntTech
AntTech
AntTech
How OceanBase Guarantees Data Reliability and Service High‑Availability

Traditional commercial databases such as Oracle and DB2 rely heavily on high‑end hardware to provide data reliability (RPO = 0) and service availability, using techniques like redo logs, primary‑standby hot‑backup, backup/restore, and storage‑layer checks, but they still cannot guarantee zero data loss without expensive hardware.

With the shift to inexpensive PC servers, hardware reliability drops, making it difficult for distributed databases to meet the same guarantees. OceanBase addresses this by moving most protection mechanisms to the software layer, especially by integrating the Paxos consensus protocol with the traditional write‑ahead‑log (WAL) system.

In OceanBase, every redo‑log entry is synchronously replicated to a majority of Paxos replicas before being persisted, ensuring that even if a minority of replicas fail, the latest log is retained, achieving true RPO = 0 without relying on high‑end hardware.

OceanBase also augments data integrity with multiple verification steps: (1) redo‑log entries carry checksum information to detect silent disk errors; (2) data pages on storage are similarly checksummed; (3) periodic checkpoint consistency checks compare data across replicas; (4) index‑table consistency checks verify logical relationships; (5) background tasks regularly scan verification data to proactively report silent errors.

For service availability, OceanBase leverages Paxos leader election: only a majority can elect a leader, preventing split‑brain scenarios. When the current leader fails, remaining followers automatically trigger a new election, eliminating manual fail‑over and reducing RTO to seconds. The system uses a 10‑second heartbeat interval and requires consecutive failures before declaring a node down, minimizing false alarms.

Fault isolation is further refined by running Paxos groups at the partition level, so a failure in one partition does not affect others on the same machine. Special handling also allows temporarily excluding a flaky node from leader elections.

In practice, OceanBase can detect a node failure within about 10 seconds and restore service within 10‑30 seconds, though exact times depend on the number of affected machines, partition count, and failure type.

The article concludes that after extensive real‑world testing in Ant Financial’s systems (e.g., Alipay and NetBank), OceanBase’s combination of Paxos‑based consensus, comprehensive data verification, and automated fail‑over provides a robust solution that surpasses traditional single‑point databases in both data safety and service stability.

High AvailabilityDistributed Databasefault toleranceData ReliabilityPaxosOceanBase
AntTech
Written by

AntTech

Technology is the core driver of Ant's future creation.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.