2 VMs vapourized during migration, config missing, ghostbuster required.

brainsoft

New Member
Jun 25, 2024
24
0
1
Hi Everyone,
I offlined and migrated a bunch of small guests to another node (NON-HA). I was doing one at a time and it was going okay, and I got impatient and tried to do 4 at the same time. They all failed migration, and when i tried again one at time, 2 more went (i thought) and 2 failed. I tried again on what I thought were the last 2, but to a different node, and they migrated succefully. Great, the trouble node is now offloaded and ready for testing/monitoring or full wipe.

The two vms I thought had moved okay actually just disappeared. The receiving host lists their VMID in the gui with the little migration paper airplane, but "no config" when you try to access them. Config isn't on either host, but are locked on the gui only on the receiving host. I don't care about the guests and am fine if I can just kill the gui ghosts (and find and kill any orphans. I can pull them back from PBS in short order, but I don't want to restore them until I have killed the ghosts so I can reuse their vmids.

I tried a couple of things but I'm not sure where the gui is pulling that info from, since the vmids config files don't actually exist.

Code:
root@regulus:~# qm unlock 801
Configuration file 'nodes/regulus/qemu-server/801.conf' does not exist

Code:
root@regulus:~# qm destroy 801
Configuration file 'nodes/regulus/qemu-server/801.conf' does not exist

Everything is PVE 8.4.1 and up-to-date.
 
please provide the contents of the corresponding migration logs, "pveversion -v" from all nodes, and "find /etc/pve"
 
Hi Fabian, thanks for your help, please see below:
My uninformed impression was of a migration traffic jam. Not sure how much data is actually transferred when migrating a 32gb vm image that's mostly empty space, but I effectively attempted to simultaneously transfer 4 or 5 of them all at the same time to a host with only 64gb of free space, over a gigabit connection.

Also, possibly unrelated, the tiberius node, the primary one that's had some gremlins lately after a couple hard shutdowns following a UPS failure has also been demonstrating extended periods of 100% cpu utilization, with the big ticket items on top being node, and occasionally kvm as well. These vms are not really "doing" anything, wasn't able to sort it out yet.



Primary node where the guests transferred away from.
Code:
root@tiberius:~# pveversion -v
proxmox-ve: 8.4.0 (running kernel: 6.8.12-10-pve)
pve-manager: 8.4.1 (running version: 8.4.1/2a5fa54a8503f96d)
proxmox-kernel-helper: 8.1.1
proxmox-kernel-6.8: 6.8.12-10
proxmox-kernel-6.8.12-9-pve-signed: 6.8.12-9
proxmox-kernel-6.8.12-6-pve-signed: 6.8.12-6
proxmox-kernel-6.8.12-4-pve-signed: 6.8.12-4
ceph-fuse: 17.2.7-pve3
corosync: 3.1.9-pve1
criu: 3.17.1-2+deb12u1
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx11
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libknet1: 1.30-pve2
libproxmox-acme-perl: 1.6.0
libproxmox-backup-qemu0: 1.5.1
libproxmox-rs-perl: 0.3.5
libpve-access-control: 8.2.2
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.1.0
libpve-cluster-perl: 8.1.0
libpve-common-perl: 8.3.1
libpve-guest-common-perl: 5.2.2
libpve-http-server-perl: 5.2.2
libpve-network-perl: 0.11.2
libpve-rs-perl: 0.9.4
libpve-storage-perl: 8.3.6
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.6.0-2
proxmox-backup-client: 3.4.1-1
proxmox-backup-file-restore: 3.4.1-1
proxmox-firewall: 0.7.1
proxmox-kernel-helper: 8.1.1
proxmox-mail-forward: 0.3.2
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.7
proxmox-widget-toolkit: 4.3.10
pve-cluster: 8.1.0
pve-container: 5.2.6
pve-docs: 8.4.0
pve-edk2-firmware: 4.2025.02-3
pve-esxi-import-tools: 0.7.4
pve-firewall: 5.1.1
pve-firmware: 3.15-3
pve-ha-manager: 4.0.7
pve-i18n: 3.4.2
pve-qemu-kvm: 9.2.0-5
pve-xtermjs: 5.5.0-2
qemu-server: 8.3.12
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.7-pve2

From the node where the 2 ghosts live (vmid 801 and 999)
Code:
root@regulus:~# pveversion -v
proxmox-ve: 8.4.0 (running kernel: 6.8.12-9-pve)
pve-manager: 8.4.1 (running version: 8.4.1/2a5fa54a8503f96d)
proxmox-kernel-helper: 8.1.1
proxmox-kernel-6.8: 6.8.12-9
proxmox-kernel-6.8.12-9-pve-signed: 6.8.12-9
proxmox-kernel-6.8.12-6-pve-signed: 6.8.12-6
proxmox-kernel-6.8.12-4-pve-signed: 6.8.12-4
ceph-fuse: 17.2.7-pve3
corosync: 3.1.9-pve1
criu: 3.17.1-2+deb12u1
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx11
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libknet1: 1.30-pve2
libproxmox-acme-perl: 1.6.0
libproxmox-backup-qemu0: 1.5.1
libproxmox-rs-perl: 0.3.5
libpve-access-control: 8.2.2
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.1.0
libpve-cluster-perl: 8.1.0
libpve-common-perl: 8.3.1
libpve-guest-common-perl: 5.2.2
libpve-http-server-perl: 5.2.2
libpve-network-perl: 0.11.2
libpve-rs-perl: 0.9.4
libpve-storage-perl: 8.3.6
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.6.0-2
proxmox-backup-client: 3.4.0-1
proxmox-backup-file-restore: 3.4.0-1
proxmox-firewall: 0.7.1
proxmox-kernel-helper: 8.1.1
proxmox-mail-forward: 0.3.2
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.7
proxmox-widget-toolkit: 4.3.10
pve-cluster: 8.1.0
pve-container: 5.2.6
pve-docs: 8.4.0
pve-edk2-firmware: 4.2025.02-3
pve-esxi-import-tools: 0.7.3
pve-firewall: 5.1.1
pve-firmware: 3.15-3
pve-ha-manager: 4.0.7
pve-i18n: 3.4.2
pve-qemu-kvm: 9.2.0-5
pve-xtermjs: 5.5.0-2
qemu-server: 8.3.12
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.7-pve2

From the 3rd node, where the two that failed actually succeeded
Code:
root@marcus:~# pveversion -v
proxmox-ve: 8.4.0 (running kernel: 6.8.12-9-pve)
pve-manager: 8.4.1 (running version: 8.4.1/2a5fa54a8503f96d)
proxmox-kernel-helper: 8.1.1
proxmox-kernel-6.8: 6.8.12-9
proxmox-kernel-6.8.12-9-pve-signed: 6.8.12-9
proxmox-kernel-6.8.12-6-pve-signed: 6.8.12-6
proxmox-kernel-6.8.12-4-pve-signed: 6.8.12-4
ceph-fuse: 17.2.7-pve3
corosync: 3.1.9-pve1
criu: 3.17.1-2+deb12u1
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx11
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libknet1: 1.30-pve2
libproxmox-acme-perl: 1.6.0
libproxmox-backup-qemu0: 1.5.1
libproxmox-rs-perl: 0.3.5
libpve-access-control: 8.2.2
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.1.0
libpve-cluster-perl: 8.1.0
libpve-common-perl: 8.3.1
libpve-guest-common-perl: 5.2.2
libpve-http-server-perl: 5.2.2
libpve-network-perl: 0.11.2
libpve-rs-perl: 0.9.4
libpve-storage-perl: 8.3.6
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.6.0-2
proxmox-backup-client: 3.4.0-1
proxmox-backup-file-restore: 3.4.0-1
proxmox-firewall: 0.7.1
proxmox-kernel-helper: 8.1.1
proxmox-mail-forward: 0.3.2
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.7
proxmox-widget-toolkit: 4.3.10
pve-cluster: 8.1.0
pve-container: 5.2.6
pve-docs: 8.4.0
pve-edk2-firmware: 4.2025.02-3
pve-esxi-import-tools: 0.7.3
pve-firewall: 5.1.1
pve-firmware: 3.15-3
pve-ha-manager: 4.0.7
pve-i18n: 3.4.2
pve-qemu-kvm: 9.2.0-5
pve-xtermjs: 5.5.0-2
qemu-server: 8.3.12
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.7-pve2

And finally
Code:
root@marcus:~# pveversion -v
proxmox-ve: 8.4.0 (running kernel: 6.8.12-9-pve)
pve-manager: 8.4.1 (running version: 8.4.1/2a5fa54a8503f96d)
proxmox-kernel-helper: 8.1.1
proxmox-kernel-6.8: 6.8.12-9
proxmox-kernel-6.8.12-9-pve-signed: 6.8.12-9
proxmox-kernel-6.8.12-6-pve-signed: 6.8.12-6
proxmox-kernel-6.8.12-4-pve-signed: 6.8.12-4
ceph-fuse: 17.2.7-pve3
corosync: 3.1.9-pve1
criu: 3.17.1-2+deb12u1
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx11
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libknet1: 1.30-pve2
libproxmox-acme-perl: 1.6.0
libproxmox-backup-qemu0: 1.5.1
libproxmox-rs-perl: 0.3.5
libpve-access-control: 8.2.2
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.1.0
libpve-cluster-perl: 8.1.0
libpve-common-perl: 8.3.1
libpve-guest-common-perl: 5.2.2
libpve-http-server-perl: 5.2.2
libpve-network-perl: 0.11.2
libpve-rs-perl: 0.9.4
libpve-storage-perl: 8.3.6
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.6.0-2
proxmox-backup-client: 3.4.0-1
proxmox-backup-file-restore: 3.4.0-1
proxmox-firewall: 0.7.1
proxmox-kernel-helper: 8.1.1
proxmox-mail-forward: 0.3.2
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.7
proxmox-widget-toolkit: 4.3.10
pve-cluster: 8.1.0
pve-container: 5.2.6
pve-docs: 8.4.0
pve-edk2-firmware: 4.2025.02-3
pve-esxi-import-tools: 0.7.3
pve-firewall: 5.1.1
pve-firmware: 3.15-3
pve-ha-manager: 4.0.7
pve-i18n: 3.4.2
pve-qemu-kvm: 9.2.0-5
pve-xtermjs: 5.5.0-2
qemu-server: 8.3.12
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.7-pve2
 
Can you check the actual file structure (find /etc/pve)? Also, the migration logs (the job should be in the GUI).

I’m presuming they are all in the same cluster and the nodes are healthy. What is your storage fabric, and are the VM images still there?
 
Last edited:
Well today I guess the troubles compounded enough that pve gui shat the bed. Couldn't restart the services, swap was pulling 100% cpu. Restarted the node, conf files regenerated somehow, so I was able to qm unlock the two items. Went through and kills all the offline migrated vms and just thinned it all out and we'll see how she goes over the weekend.

Hopefully there are no orphan images laying around, but I feel pretty good about actually being able to delete the two locked ghosts through the gui.

This is the second node since this fiasco started to just keep getting pegged at 100% cpu. rg is at the top of the list and just pegged there after a minute or two of starting. Of course it has the firewall on it and has been running flawlessly until i migrated those vms to it.

There were a couple of package upgrades so just did a full upgrade and restart on each node and the cpue load issue seems to have gone away for now. I nuked every guest and will restore one by one and monitor. Sure am glad this is not a production environment or I'd be having a crappy morning.
 
Everything appears to have stabilized. I basically nuked every guest and restored from backups from before the ups failed. It feels like maybe there was a corrupt image or lxc that had a memory leak because the ram usage also kept climbing until all cpu was used for swapping or something.

Will keep monitoring... Fingers crossed!