Operations 6 min read

Recovering VMs After a Fusion Computer CNA Node Crash: A Step-by-Step Ops Guide

After a Fusion Computer CNA node crashed and the VRM platform became unavailable, the author recovered lost virtual machines by locating disk images, fixing OVS flow tables, using an scp command with strict host‑key checking disabled, and recreating VMs, sharing key lessons for reliable operations.

Ops Development Stories

Dec 7, 2020

Recovering VMs After a Fusion Computer CNA Node Crash: A Step-by-Step Ops Guide

Yesterday a Fusion Computer CNA node suddenly failed; after reboot it could not boot, and the VRM management platform (deployed on that node without active‑passive configuration) also became unusable. The node was reinstalled along with VRM, but when adding the previously healthy node to the new VRM, all VMs disappeared from virsh list --all, although the storage remained intact.

Storage Data Recovery

Inspecting the system revealed that the data resides in /POME/datastore_1/vol, where files are named by disk IDs. By comparing file sizes with known VM types and counting the files, the number matched the VMs and templates on VRM.

The plan was to create a new VM of the same type, replace its disk file with the recovered one, and overwrite the new disk. After adding the node to VRM, an error occurred; rebooting the node caused network loss because the distributed switch (OVS) missed an output flow and two ports.

Unfamiliar with OVS, the focus shifted to data recovery. Copying the data via USB failed because the NTFS driver was missing and the node had no network access. Directly connecting the node to another server and using SCP triggered a strict host‑key verification error.

No ECDSA host key is known for 192.168.1.1 and you have requested strict checking.
Host key verification failed.
lost connection

Adding the -o StrictHostKeyChecking=no option allowed the transfer.

scp -o StrictHostKeyChecking=no /POME/datastore_1/vol/vol_fb5b2975-e6e8-41db-8675-10556bfa8df3/ 192.168.1.1:/home

A new VM was created, the recovered disk file was renamed to match the new VM’s disk ID, and the files were overwritten. The directory contained three files: the disk image (named by ID), snapshot_list.cfg (listing the disk ID), and a binary Cnalockfile. Absence of Cnalockfile indicated a template, while its presence signified an instantiated VM. After replacement, the VM booted successfully with its original system.

Lessons Learned

Configure VRM in active‑passive mode to avoid single‑point failures.

Copying a 40 GB VM image took over two hours, likely due to a faulty disk; replace the disk promptly.

Stay calm, analyze the problem systematically, and extract actionable lessons from incidents.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Operations Data Recovery OVS CNA node Fusion Computer VRM

Written by

Ops Development Stories

Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.