Operations 7 min read

Why Oracle RAC Node Reboots Repeatedly? Time Sync, Heartbeats & Hardware Fixes

This article details a step‑by‑step investigation of a two‑node Oracle RAC environment where node 2 repeatedly rebooted, covering time‑synchronization checks, database and network heartbeat analysis, hardware diagnostics, and the final resolution through firmware and memory replacement.

Efficient Ops
Efficient Ops
Efficient Ops
Why Oracle RAC Node Reboots Repeatedly? Time Sync, Heartbeats & Hardware Fixes

Environment

Two physical servers (model R680) were configured as a 2‑node Oracle RAC cluster running Oracle 11.2.0.4.

1. Fault Phenomenon

Node 2 experienced frequent reboots from January to February, sometimes three times in a single day.

2. Analysis and Handling Process

1) Time Synchronization Check

The initial suspicion was time drift. NTP offset was observed at 11376 seconds, and abnormal return values appeared in the CTSS logs. The NTP sources were updated from the old servers (10.33.144.18/19) to the new data‑center servers (11.8.13.1/9), and the BIOS clock was aligned with the system clock, eliminating the time‑sync issue.

2) Database Log Examination

Alert logs showed node eviction events, and CSSD logs indicated disk heartbeat present but network heartbeat missing, suggesting a private‑network problem.

In RAC, loss of consecutive network or disk heartbeats can trigger node eviction (node kill escalation) after a miss‑count threshold (default 30 s for network, 200 s for disk). With only two nodes, a split‑brain situation can cause the lower‑numbered node to survive.

3) Network Investigation

The RAC heartbeat network used two NICs (ETH13 and ETH15) connected to two switches. Activating failed switch ports and NICs, swapping cables, and isolating links revealed significant optical loss, but the reboot issue persisted.

4) Hardware Investigation

Since network and database checks were inconclusive, hardware was examined. MCELOG logs (Machine Check Exception logs) were reviewed for CPU and memory errors. The logs indicated memory‑related hardware errors, which can cause server reboots.

After consulting the hardware vendor, the motherboard firmware was updated and a faulty memory module was replaced, resolving the reboot problem.

3. Summary and Reflections

1) Monitoring is crucial; the hardware issue was not detected by the existing monitoring platform.

2) A comprehensive investigation covering logs, network, database, system, and hardware eventually uncovered the root cause.

3) Patience and meticulousness are essential; systematic troubleshooting leads to resolution.

operationsdatabasetroubleshootingOracleRAC
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.