Recovering VMs After a Fusion Computer CNA Node Crash: A Step-by-Step Ops Guide
After a Fusion Computer CNA node crashed and the VRM platform became unavailable, the author recovered lost virtual machines by locating disk images, fixing OVS flow tables, using an scp command with strict host‑key checking disabled, and recreating VMs, sharing key lessons for reliable operations.
Yesterday a Fusion Computer CNA node suddenly failed; after reboot it could not boot, and the VRM management platform (deployed on that node without active‑passive configuration) also became unusable. The node was reinstalled along with VRM, but when adding the previously healthy node to the new VRM, all VMs disappeared from
virsh list --all, although the storage remained intact.
Storage Data Recovery
Inspecting the system revealed that the data resides in
/POME/datastore_1/vol, where files are named by disk IDs. By comparing file sizes with known VM types and counting the files, the number matched the VMs and templates on VRM.
The plan was to create a new VM of the same type, replace its disk file with the recovered one, and overwrite the new disk. After adding the node to VRM, an error occurred; rebooting the node caused network loss because the distributed switch (OVS) missed an output flow and two ports.
Unfamiliar with OVS, the focus shifted to data recovery. Copying the data via USB failed because the NTFS driver was missing and the node had no network access. Directly connecting the node to another server and using SCP triggered a strict host‑key verification error.
<code>No ECDSA host key is known for 192.168.1.1 and you have requested strict checking.
Host key verification failed.
lost connection</code>Adding the
-o StrictHostKeyChecking=nooption allowed the transfer.
<code>scp -o StrictHostKeyChecking=no /POME/datastore_1/vol/vol_fb5b2975-e6e8-41db-8675-10556bfa8df3/ 192.168.1.1:/home</code>A new VM was created, the recovered disk file was renamed to match the new VM’s disk ID, and the files were overwritten. The directory contained three files: the disk image (named by ID),
snapshot_list.cfg(listing the disk ID), and a binary
Cnalockfile. Absence of
Cnalockfileindicated a template, while its presence signified an instantiated VM. After replacement, the VM booted successfully with its original system.
Lessons Learned
Configure VRM in active‑passive mode to avoid single‑point failures.
Copying a 40 GB VM image took over two hours, likely due to a faulty disk; replace the disk promptly.
Stay calm, analyze the problem systematically, and extract actionable lessons from incidents.
Ops Development Stories
Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.