Can't migrate, backup, copy VMs

aconnor · May 6, 2022

I've been having issues with a node so I added another one, made a cluster and figured I'd move from old to new.

Each node- 6 core/ 12 thread NUC, 64GB RAM, one has 3x 2TB NVMe in RAIDZ-1, the 'bad' node has 2x 2TB NVMe in ZFS mirror
Both have 2x 1gb-e and a PCI 2 port 10Gb-e card (Intel).
Management on 1Gb-e, all VMs on 10Gb-e

It starts copying, lasts around 5 minutes and then-

This crashes the old node, cluster, everything. Requires a manual poke to reboot.
Originally I thought that a ZFS error may be responsible (due to an earlier crash) but now not sure

I've also tried shutting down a VM and migrating it cold, same issue.
tried using gdisk to fix the partition error, same issue
tried setting up a Proxmox backup server, haven't got it working yet

Here are the results of my latest attempt at cold migration-

2022-05-06 23:18:10 23:18:10   29.0G   rpool/data/vm-101-disk-1@__migration__
2022-05-06 23:18:11 warning: cannot send 'rpool/data/vm-101-disk-1@__migration__': Input/output error
2022-05-06 23:18:12 cannot receive new filesystem stream: checksum mismatch
2022-05-06 23:18:12 cannot open 'rpool/data/vm-101-disk-3': dataset does not exist
2022-05-06 23:18:12 command 'zfs recv -F -- rpool/data/vm-101-disk-3' failed: exit code 1
send/receive failed, cleaning up snapshot(s)..
2022-05-06 23:18:12 ERROR: storage migration for 'local-zfs:vm-101-disk-1' to storage 'local-zfs' failed - command 'set -o pipefail && pvesm export local-zfs:vm-101-disk-1 zfs - -with-snapshots 0 -snapshot __migration__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pve02' root@172.20.1.20 -- pvesm import local-zfs:vm-101-disk-1 zfs - -with-snapshots 0 -snapshot __migration__ -delete-snapshot __migration__ -allow-rename 1' failed: exit code 1
2022-05-06 23:18:12 aborting phase 1 - cleanup resources
2022-05-06 23:18:13 ERROR: migration aborted (duration 00:04:36): storage migration for 'local-zfs:vm-101-disk-1' to storage 'local-zfs' failed - command 'set -o pipefail && pvesm export local-zfs:vm-101-disk-1 zfs - -with-snapshots 0 -snapshot __migration__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pve02' root@172.20.1.20 -- pvesm import local-zfs:vm-101-disk-1 zfs - -with-snapshots 0 -snapshot __migration__ -delete-snapshot __migration__ -allow-rename 1' failed: exit code 1
TASK ERROR: migration aborted

I'm going nuts- where do I start looking for clues please?

mira · May 6, 2022

Check the syslog and dmesg for any disk errors.
Does the pool show any errors on both sides? zpool status

If it crashes a host, then there's most likely something in the syslog and the dmesg output.

aconnor · May 7, 2022

thanks for the reply @mira
zpool status shows some errors

root@pve01:~# zpool status -v
  pool: rpool
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 00:08:23 with 8 errors on Sat May  7 21:04:25 2022
config:

        NAME                                                 STATE     READ WRITE CKSUM
        rpool                                                DEGRADED     0     0     0
          mirror-0                                           DEGRADED     0     0     0
            nvme-eui.0000000001000000e4d25cb59df55201-part3  DEGRADED     0     0    32  too many errors
            nvme-eui.0000000001000000e4d25cb39df55201-part3  DEGRADED     0     0    32  too many errors

errors: Permanent errors have been detected in the following files:

        rpool/data/vm-101-disk-1:<0x1>
        rpool/data/vm-103-disk-0:<0x1>

I've scrubbed the pool, but these errors in the 2 vm disks remain.

Right now I don't massively care about vm-101, I can back up the config and rebuild pretty easily.
Using gdisk I was able to (temporarily?) fix vm-101, but it still wouldn't copy anywhere

However vm-103 is a cPanel server and my last backup is 6 days old.
It won't boot any longer because I tried the same trick and borked the boot partition.
Currently working on that - rescue boot for CentOS has no network, and needs it because the grub drivers aren't on the rescue disk...

But if I can boot vm-103 I can migrate the accounts to the new node where I've installed a new cPanel server...

key goals
1. any method to copy the VMs off that node
2. how to fix the boot partition on Centos cPanel (I'm halfway there)

mira · May 9, 2022

As those are `ZVols`, you should be able to copy them the same as any other data on a block device. One tool to do that is dd
Although there's no guarantee that the data is still intact.

Which NVMes are you using? (Vendor + Model)

aconnor · May 10, 2022

Hi @mira
the drives are
Vendor= Intel
Model= 665p

I am attaching pics of the drives, and also where the boot process stopped - please let me know if this error is familiar- or fixable

mira · May 10, 2022

There doesn't seem to be a firmware update available: https://www.intel.com/content/www/us/en/support/articles/000017245/memory-and-storage.html

If possible, replace the drives and see if checksum errors are gone for good.

aconnor · May 10, 2022

thanks @mira - the node did have significant bios firmware updates to do- from v34 to v69 or something. I didn't think about the storage.

Is there any way to figure out the boot issue / and or any way to poke around the filesystem to find out what happened?

I no longer care about the data, but I would feel happier if I could figure out if I made a mistake that I can avoid in future

mira · May 10, 2022

Usually both the journal (journalctl) and the dmesg contain any errors that are logged.

Your VM with the corrupted disk(rpool/data/vm-103-disk-0:<0x1>) is the one with boot issues?
Depending on the corruption, it might not be possible to fix it.

aconnor · May 10, 2022

Hi @mira yes- over the weekend I was able to repair the boot partition of vm 103, and it was working Monday.

Attempting to transfer the vm to a new node on Monday night and the node crashed.
I rebooted node by using a remote power switch- I guess that's probably what killed it, because it never booted again...

Thanks will see if I can find anything in the logs- all of the symptom of this issue were kind of disk related, but I'm worried that I missed something as there was nothing in SMART and 'zpool scrub' seemed ok

I spent a lot of time trying to figure this out, but there's not much on fixing/ managing/ recovering zfs mirror

mira · May 10, 2022

Yes, if all disks are affected, which is the case here, then it basically boils down to 'restore from backup'.

aconnor · May 10, 2022

ok thank you.

Just to close this off for future readers- the node would not boot with only one disk in it- I tried both individually.
If one disk was damaged we'd expect the good one to boot- I hope!

Search

Search

Can't migrate, backup, copy VMs

aconnor

New Member

mira

Proxmox Staff Member

aconnor

New Member

mira

Proxmox Staff Member

aconnor

New Member

Attachments

mira

Proxmox Staff Member

aconnor

New Member

mira

Proxmox Staff Member

aconnor

New Member

mira

Proxmox Staff Member

aconnor

New Member