Can't migrate, backup, copy VMs

Aug 24, 2021
24
2
1
57
I've been having issues with a node so I added another one, made a cluster and figured I'd move from old to new.

Each node- 6 core/ 12 thread NUC, 64GB RAM, one has 3x 2TB NVMe in RAIDZ-1, the 'bad' node has 2x 2TB NVMe in ZFS mirror
Both have 2x 1gb-e and a PCI 2 port 10Gb-e card (Intel).
Management on 1Gb-e, all VMs on 10Gb-e

It starts copying, lasts around 5 minutes and then-

This crashes the old node, cluster, everything. Requires a manual poke to reboot.
Originally I thought that a ZFS error may be responsible (due to an earlier crash) but now not sure

I've also tried shutting down a VM and migrating it cold, same issue.
tried using gdisk to fix the partition error, same issue
tried setting up a Proxmox backup server, haven't got it working yet

Here are the results of my latest attempt at cold migration-

2022-05-06 23:18:10 23:18:10 29.0G rpool/data/vm-101-disk-1@__migration__ 2022-05-06 23:18:11 warning: cannot send 'rpool/data/vm-101-disk-1@__migration__': Input/output error 2022-05-06 23:18:12 cannot receive new filesystem stream: checksum mismatch 2022-05-06 23:18:12 cannot open 'rpool/data/vm-101-disk-3': dataset does not exist 2022-05-06 23:18:12 command 'zfs recv -F -- rpool/data/vm-101-disk-3' failed: exit code 1 send/receive failed, cleaning up snapshot(s).. 2022-05-06 23:18:12 ERROR: storage migration for 'local-zfs:vm-101-disk-1' to storage 'local-zfs' failed - command 'set -o pipefail && pvesm export local-zfs:vm-101-disk-1 zfs - -with-snapshots 0 -snapshot __migration__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pve02' root@172.20.1.20 -- pvesm import local-zfs:vm-101-disk-1 zfs - -with-snapshots 0 -snapshot __migration__ -delete-snapshot __migration__ -allow-rename 1' failed: exit code 1 2022-05-06 23:18:12 aborting phase 1 - cleanup resources 2022-05-06 23:18:13 ERROR: migration aborted (duration 00:04:36): storage migration for 'local-zfs:vm-101-disk-1' to storage 'local-zfs' failed - command 'set -o pipefail && pvesm export local-zfs:vm-101-disk-1 zfs - -with-snapshots 0 -snapshot __migration__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pve02' root@172.20.1.20 -- pvesm import local-zfs:vm-101-disk-1 zfs - -with-snapshots 0 -snapshot __migration__ -delete-snapshot __migration__ -allow-rename 1' failed: exit code 1 TASK ERROR: migration aborted

I'm going nuts- where do I start looking for clues please?
 
Check the syslog and dmesg for any disk errors.
Does the pool show any errors on both sides? zpool status

If it crashes a host, then there's most likely something in the syslog and the dmesg output.
 
thanks for the reply @mira
zpool status shows some errors


root@pve01:~# zpool status -v pool: rpool state: DEGRADED status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A scan: scrub repaired 0B in 00:08:23 with 8 errors on Sat May 7 21:04:25 2022 config: NAME STATE READ WRITE CKSUM rpool DEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 nvme-eui.0000000001000000e4d25cb59df55201-part3 DEGRADED 0 0 32 too many errors nvme-eui.0000000001000000e4d25cb39df55201-part3 DEGRADED 0 0 32 too many errors errors: Permanent errors have been detected in the following files: rpool/data/vm-101-disk-1:<0x1> rpool/data/vm-103-disk-0:<0x1>

I've scrubbed the pool, but these errors in the 2 vm disks remain.

Right now I don't massively care about vm-101, I can back up the config and rebuild pretty easily.
Using gdisk I was able to (temporarily?) fix vm-101, but it still wouldn't copy anywhere

However vm-103 is a cPanel server and my last backup is 6 days old.
It won't boot any longer because I tried the same trick and borked the boot partition.
Currently working on that - rescue boot for CentOS has no network, and needs it because the grub drivers aren't on the rescue disk...

But if I can boot vm-103 I can migrate the accounts to the new node where I've installed a new cPanel server...

key goals
1. any method to copy the VMs off that node
2. how to fix the boot partition on Centos cPanel (I'm halfway there)
 
As those are `ZVols`, you should be able to copy them the same as any other data on a block device. One tool to do that is dd
Although there's no guarantee that the data is still intact.

Which NVMes are you using? (Vendor + Model)
 
Hi @mira
the drives are
Vendor= Intel
Model= 665p

I am attaching pics of the drives, and also where the boot process stopped - please let me know if this error is familiar- or fixable
 

Attachments

  • IMG_8058.jpg
    IMG_8058.jpg
    94.8 KB · Views: 11
  • IMG_8061.jpg
    IMG_8061.jpg
    82.6 KB · Views: 11
thanks @mira - the node did have significant bios firmware updates to do- from v34 to v69 or something. I didn't think about the storage.

Is there any way to figure out the boot issue / and or any way to poke around the filesystem to find out what happened?

I no longer care about the data, but I would feel happier if I could figure out if I made a mistake that I can avoid in future
 
Usually both the journal (journalctl) and the dmesg contain any errors that are logged.

Your VM with the corrupted disk(rpool/data/vm-103-disk-0:<0x1>) is the one with boot issues?
Depending on the corruption, it might not be possible to fix it.
 
Hi @mira yes- over the weekend I was able to repair the boot partition of vm 103, and it was working Monday.

Attempting to transfer the vm to a new node on Monday night and the node crashed.
I rebooted node by using a remote power switch- I guess that's probably what killed it, because it never booted again...

Thanks will see if I can find anything in the logs- all of the symptom of this issue were kind of disk related, but I'm worried that I missed something as there was nothing in SMART and 'zpool scrub' seemed ok

I spent a lot of time trying to figure this out, but there's not much on fixing/ managing/ recovering zfs mirror
 
Yes, if all disks are affected, which is the case here, then it basically boils down to 'restore from backup'.
 
ok thank you.

Just to close this off for future readers- the node would not boot with only one disk in it- I tried both individually.
If one disk was damaged we'd expect the good one to boot- I hope!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!