Databases 13 min read

What Really Caused Blizzard’s Hearthstone Outage? Inside Oracle RAC Failure

An in‑depth look at Blizzard’s Hearthstone outage reveals that a power interruption likely triggered storage failures in an Oracle RAC cluster, leading to corrupted blocks, incomplete recovery, and a reliance on backups, while highlighting broader lessons on database backup, disaster recovery, and operational decision‑making.

Efficient Ops
Efficient Ops
Efficient Ops
What Really Caused Blizzard’s Hearthstone Outage? Inside Oracle RAC Failure

Blizzard Hearthstone Outage Analysis

Recent statements from Blizzard and NetEase blamed a power interruption that caused data corruption, sparking many speculations. The incident provides a valuable case study for complex failure handling.

The core database architecture is not MySQL as rumored, but an Oracle RAC cluster with ASM storage, likely running version 12.1.0.2 on Linux, using GoldenGate for replication. This is inferred from a DBA Lead job posting that requires deep knowledge of Oracle RAC, ASM, GoldenGate, and Linux scripting.

Database: Oracle Architecture: RAC + ASM Version: 12.1.0.2 (estimated) Nodes: 4 (estimated) OS: Linux Replication: GoldenGate

Key timeline (as reconstructed from public information):

Jan 14 15:20 – Power issue leads to database corruption. DBA begins repair but discovers backup also corrupted. Database continues running with damage while online repair proceeds. Jan 17 01:00 – Planned downtime for repair (expected 8 hours) but not completed. Jan 18 18:00 – Announcement of data rollback to Jan 14 15:20 and service restoration.

Analysis of the failure points:

1. The fault occurred before the public announcement, following typical practice. 2. A single‑point power failure is unlikely in mature IT environments. 3. The database likely suffered bad blocks; Oracle can keep running despite such damage, leading to prolonged secondary failures. 4. Blizzard lacked an active Data Guard disaster‑recovery setup; the statement mentions a “backup database” rather than a standby. 5. GoldenGate replication could not be used for recovery, possibly due to corrupted redo or undo logs. 6. Ultimately, unrecoverable bad blocks forced a time‑point partial restore, sacrificing some transactions. 7. The damage was limited to a small range of blocks, suggesting storage‑related write loss. 8. The estimated database size is around 10 TB based on the 8‑hour recovery window.

The root cause is hypothesized to be storage failure causing write loss in the RAC cluster, leading to an incomplete recovery.

DBA First Rule: Backup Over Everything

Without reliable backups, recovery is impossible. The incident underscores that Blizzard’s eventual rescue relied on a backup to roll back to Jan 14.

A 2016 Oracle database operations report for China showed that fewer than 20 % of databases have complete RMAN backups, and only about 24 % operate in archivelog mode.

Key lessons for operations teams:

Maintain effective backup and disaster‑recovery mechanisms; backups must be tested for speed and reliability.

Define clear policies on whether to prioritize availability (C) or consistency (A) during incidents.

Implement robust fault‑handling processes and end‑to‑end emergency procedures.

Ensure rapid collaboration across internal and external teams during prolonged incidents.

In summary, the Blizzard outage illustrates the critical importance of proper database architecture, backup strategy, and operational decision‑making in preventing and mitigating large‑scale service disruptions.

disaster recoverybackupOracleDatabase OperationsRAC
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.