What Triggered These Real‑World System Crashes? 13 Post‑Mortem Lessons
The article compiles thirteen post‑mortem case studies of severe system outages—from AIX NTP misconfiguration and backup appliance driver issues to PowerHA node ID conflicts and hardware failures—detailing symptoms, root‑cause analysis, and practical remediation steps for each incident.
01. AIX NTP misconfiguration caused multiple cluster crashes
A friend reported that three Oracle RAC clusters on AIX machines rebooted simultaneously after a hardware relocation. Investigation revealed all clusters shared the same NTP server, but one used
xntpdwhile the others used
ntpdatevia cron. The
ntpdatejobs caused large time jumps, which made the
cssdprocess trigger a system reboot. Lesson: Prefer the
xntpdservice for time synchronization instead of periodic
ntpdatecalls.
02. Backup appliance CDP driver caused a crash
During testing of an AIX backup appliance, the CDP driver was left installed after the client was removed. Upon reboot the system failed to start. The vendor confirmed that the CDP driver must be removed before uninstalling the client.
03. LVM mirror expansion error led to data loss
In a dual‑node, dual‑storage HA setup, expanding a filesystem by adding disks directly to the VG caused data to be unevenly distributed across the two storage arrays. When one storage failed, the system lost data integrity. Lesson: When using LVM mirrors, expand the logical volume first, then the filesystem.
04. HACMP node‑ID duplication caused cluster halt
Three PowerHA XD clusters shared identical RSCT node UUIDs after an
alt_disk_copywithout the
-B -C -Ooptions. The duplicate IDs caused quorum loss and a complete halt. The fix involved stopping HA services, reinstalling the RSCT node configuration, and rebooting all nodes.
05. Power 570/595 crash due to improper CDP driver removal
After uninstalling the backup client but leaving the CDP driver, the Power 595 failed to boot. The vendor required the CDP driver to be removed first.
06. ERP backup triggered HACMP crash
During a backup window, the
haemddaemon repeatedly restarted, causing the Oracle database to stop. The issue stemmed from excessive I/O and insufficient filesystem cache, which was mitigated by adjusting
Maxpoutand
Minpoutparameters.
07. WebLogic memory‑leak crash investigation
Repeated out‑of‑memory errors were traced to non‑heap memory exhaustion. Adjusting
PermSizein
setDomainEnv.shhad no effect because
JAVA_VENDORwas set to N/A. The final fix set a proper
JAVA_VENDORand added explicit memory arguments (
-Xms2048m -Xmx2048m -XX:PermSize=1024m).
08. P550/P570 HA crash and data loss
Power failure left both UPS units partially powered, causing both P550 nodes to shut down. After hardware replacement and manual IP aliasing, the HA cluster was restored, though some
/orafiledata was lost and later recovered from backup.
09. AIX 6100‑06‑06 bug causing kernel panic
The
netstat -f unixcommand triggered a kernel panic due to a file‑lock bug (IV09793). The recommended fix is to apply the bos.mp64 patch or upgrade to level 6100‑06‑12‑1339 (SP12).
10. PowerHA node‑ID conflict during IP switch
When all IP networks were lost but a non‑IP network remained, PowerHA 6 dumped core (IV55293). Upgrading the
rsctfileset resolved the issue.
11. Power595 crash caused by I/O cabinet power loss
During a routine I/O cabinet power‑swap, an unexpected power drop caused the Power595 to crash. Replacing the I/O DCA resolved the problem.
12. X86 server crash due to faulty optical drive
An IBM X3650 running SUSE 9 hung because a defective CD/DVD drive caused kernel panics. Replacing the drive restored stability.
13. Miscellaneous hardware‑related crashes
Additional incidents include UPS failures, firmware errors, and component replacements that led to temporary outages but were resolved through hardware swaps and firmware updates.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.