Help! Proxmox live migration destroys config file and leaves VM in a zombie state

exp · 2024-10-01T09:10:13+0200

This is pretty insane. I was migrating VM 401 from PVE1 to PVE2. In the end, I got this error:

Code:

[...]
2024-09-30 23:48:16 migration active, transferred 4.9 GiB of 2.0 GiB VM-state, 16.6 MiB/s
2024-09-30 23:48:16 xbzrle: send updates to 384647 pages in 156.9 MiB encoded memory, cache-miss 65.34%, overflow 22795
2024-09-30 23:48:18 migration active, transferred 4.9 GiB of 2.0 GiB VM-state, 19.1 MiB/s, VM dirties lots of memory: 21.8 MiB/s
2024-09-30 23:48:18 xbzrle: send updates to 386878 pages in 157.3 MiB encoded memory, cache-miss 63.62%, overflow 22822
2024-09-30 23:48:18 auto-increased downtime to continue migration: 12800 ms
2024-09-30 23:48:29 average migration speed: 3.6 MiB/s - downtime 9352 ms
2024-09-30 23:48:29 migration status: completed
all 'mirror' jobs are ready
drive-efidisk0: Completing block job...
drive-efidisk0: Completed successfully.
drive-scsi0: Completing block job...
drive-scsi0: Completed successfully.
drive-scsi1: Completing block job...
drive-scsi1: Completed successfully.
drive-efidisk0: mirror-job finished
drive-scsi0: mirror-job finished
drive-scsi1: mirror-job finished
2024-09-30 23:48:31 # /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pve2' -o 'UserKnownHostsFile=/etc/pve/nodes/pve2/ssh_known_hosts' -o 'GlobalKnownHostsFile=none' root@10.227.1.21 pvesr set-state 401 \''{"local/pve1":{"last_sync":1727764103,"last_node":"pve1","last_iteration":1727764103,"duration":580.699161,"storeid_list":["local-zfs"],"last_try":1727764103,"fail_count":0}}'\'
2024-09-30 23:48:32 stopping NBD storage migration server on target.
2024-09-30 23:48:41 # /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pve2' -o 'UserKnownHostsFile=/etc/pve/nodes/pve2/ssh_known_hosts' -o 'GlobalKnownHostsFile=none' root@10.227.1.21 qm unlock 401
2024-09-30 23:48:41 ERROR: failed to clear migrate lock: Configuration file 'nodes/pve2/qemu-server/401.conf' does not exist
2024-09-30 23:48:41 ERROR: migration finished with problems (duration 00:20:20)
TASK ERROR: migration problems

The VM is still reachable via ping, and the kvm process is running on VPE2. But it doesn't show up in proxmox in either VPE and not in "qm list" either.

The source PVE1 looks like this:

And on both PVEs (!) /etc/pve/qemu-server/401.conf does not exist any more. It's just gone.

Is 401.conf somewhere still available (except in the backup which I should have)?
How to bring this into a consistent state? What happens to changes in the VM? Is my VM now "officially" running on PVE2 or not?
Most importantly: How on earth can something like this happen?

EDIT: It just gets crazier and crazier. I am trying to regenerate 401.conf from the parameters I find in the process list (since the VM is still running). On one hand, the config file does not exist. On the other hand, it does. Now what? WTF?

Code:

root@pve1:/etc/pve/qemu-server# cat 401.conf
cat: 401.conf: No such file or directory
root@pve1:/etc/pve/qemu-server# cp /tmp/401.conf .
cp: cannot create regular file './401.conf': File exists
root@pve1:/etc/pve/qemu-server#

exp · 2024-10-01T10:40:29+0200

Ok, after lots of sweat, rebooting PVEs and VMs I was able to recover.

I did some post-mortem and it seems there was an out of memory condition at some point in a prior migration attempt (see log below)

But I am still in shock that this can even happen. Shouldn't such an error condition be caught and rolled back? This experience leaves me with quite some fear for the future...

Code:

2024-09-30 23:23:14 23:23:14    392M   rpool/data/vm-401-disk-1@__replicate_401-0_1727763712__
2024-09-30 23:23:14 cannot receive incremental stream: out of space
2024-09-30 23:23:15 23:23:15    392M   rpool/data/vm-401-disk-1@__replicate_401-0_1727763712__
2024-09-30 23:23:16 command 'zfs recv -F -- rpool/data/vm-401-disk-1' failed: exit code 1
2024-09-30 23:23:16 command 'zfs send -Rpv -I __replicate_401-0_1727596823__ -- rpool/data/vm-401-disk-1@__replicate_401-0_1727763712__' failed: got signal 13
2024-09-30 23:23:16 delete previous replication snapshot '__replicate_401-0_1727763712__' on local-zfs:vm-401-disk-0
2024-09-30 23:23:16 delete previous replication snapshot '__replicate_401-0_1727763712__' on local-zfs:vm-401-disk-1
2024-09-30 23:23:17 delete previous replication snapshot '__replicate_401-0_1727763712__' on local-zfs:vm-401-disk-2
2024-09-30 23:23:17 delete previous replication snapshot '__replicate_401-0_1727763712__' on local-zfs:vm-401-state-upgrade2_2024_05_05
2024-09-30 23:23:17 delete previous replication snapshot '__replicate_401-0_1727763712__' on local-zfs:vm-401-state-upgrade2_2024_05_05b
2024-09-30 23:23:17 end replication job with error: command 'set -o pipefail && pvesm export local-zfs:vm-401-disk-1 zfs - -with-snapshots 1 -snapshot __replicate_401-0_1727763712__ -base __replicate_401-0_1727596823__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pve2' -o 'UserKnownHostsFile=/etc/pve/nodes/pve2/ssh_known_hosts' -o 'GlobalKnownHostsFile=none' root@10.227.1.21 -- pvesm import local-zfs:vm-401-disk-1 zfs - -with-snapshots 1 -snapshot __replicate_401-0_1727763712__ -allow-rename 0 -base __replicate_401-0_1727596823__' failed: exit code 255
2024-09-30 23:23:17 ERROR: command 'set -o pipefail && pvesm export local-zfs:vm-401-disk-1 zfs - -with-snapshots 1 -snapshot __replicate_401-0_1727763712__ -base __replicate_401-0_1727596823__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pve2' -o 'UserKnownHostsFile=/etc/pve/nodes/pve2/ssh_known_hosts' -o 'GlobalKnownHostsFile=none' root@10.227.1.21 -- pvesm import local-zfs:vm-401-disk-1 zfs - -with-snapshots 1 -snapshot __replicate_401-0_1727763712__ -allow-rename 0 -base __replicate_401-0_1727596823__' failed: exit code 255
2024-09-30 23:23:17 aborting phase 1 - cleanup resources
2024-09-30 23:23:17 scsi1: removing block-dirty-bitmap 'repl_scsi1'
2024-09-30 23:23:17 efidisk0: removing block-dirty-bitmap 'repl_efidisk0'
2024-09-30 23:23:17 scsi0: removing block-dirty-bitmap 'repl_scsi0'
2024-09-30 23:23:17 ERROR: migration aborted (duration 00:01:28): command 'set -o pipefail && pvesm export local-zfs:vm-401-disk-1 zfs - -with-snapshots 1 -snapshot __replicate_401-0_1727763712__ -base __replicate_401-0_1727596823__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pve2' -o 'UserKnownHostsFile=/etc/pve/nodes/pve2/ssh_known_hosts' -o 'GlobalKnownHostsFile=none' root@10.227.1.21 -- pvesm import local-zfs:vm-401-disk-1 zfs - -with-snapshots 1 -snapshot __replicate_401-0_1727763712__ -allow-rename 0 -base __replicate_401-0_1727596823__' failed: exit code 255
TASK ERROR: migration aborted

Search

Search

Help! Proxmox live migration destroys config file and leaves VM in a zombie state

exp

New Member

exp

New Member