VM crashes with live migration

lhall · Feb 25, 2021

Today we're experiencing an issue with live migration between HVs wherein the VMs (Debian Linux - Buster or Stretch) crash several minutes after the move. The two HVs primarily affected (8 in the cluster, not all tested so far) are running all the versions below and have both today been rebooted onto the latest available 5.4.98-1-pve kernel. The shared storage for these VMs is Ceph and we have had live migration working perfectly well in the past. Screenshots from a couple of VM crashes below, any help/advice anyone can offer would be greatly appreciated.

proxmox-ve: 6.3-1 (running kernel: 5.4.98-1-pve)
pve-manager: 6.3-4 (running version: 6.3-4/0a38c56f)
pve-kernel-5.4: 6.3-5
pve-kernel-helper: 6.3-5
pve-kernel-5.4.98-1-pve: 5.4.98-1
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.55-1-pve: 5.4.55-1
pve-kernel-4.15: 5.4-19
pve-kernel-4.15.18-30-pve: 4.15.18-58
pve-kernel-4.4.134-1-pve: 4.4.134-112
pve-kernel-4.4.35-1-pve: 4.4.35-77
ceph-fuse: 12.2.13-pve1
corosync: 3.1.0-pve1
criu: 3.11-3
glusterfs-client: 8.3-1
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.0.7
libproxmox-backup-qemu0: 1.0.3-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-4
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.1-1
libpve-storage-perl: 6.3-7
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
openvswitch-switch: 2.12.3-1
proxmox-backup-client: 1.0.8-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-5
pve-cluster: 6.2-1
pve-container: 3.3-4
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.2-2
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.2.0-2
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-5
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.3-pve1

t.lamprecht · Feb 25, 2021

Hi!

FWIW, we have several such Debian based VMs on hypervisors with ceph running fine with that upgrade + migration since several days.

Could be correlated with the specific environment. What hardware is used for the hypervisor hosts?

Anything in the host syslog during those crashes?

lhall · Feb 26, 2021

Thanks Thomas. These are all Supermicro machines with Xeon E5-2620 and 128 or 256G RAM. We have no reason to suspect that all of these 8 machines have suddenly developed an issue affecting all of them. It seems that the problem with live migration exists across them all.

My suspicion is that the problem has been introduced somewhere between these versions:

proxmox-ve: 6.3-1 (running kernel: 5.4.55-1-pve)
pve-manager: 6.3-3 (running version: 6.3-3/eee5f901)
pve-kernel-5.4: 6.3-3
qemu-server: 6.3-3
pve-qemu-kvm: 5.1.0-8
ceph-fuse: 12.2.13-pve1

proxmox-ve: 6.3-1 (running kernel: 5.4.55-1-pve)
pve-manager: 6.3-4 (running version: 6.3-4/0a38c56f)
pve-kernel-5.4: 6.3-5
qemu-server: 6.3-5
pve-qemu-kvm: 5.2.0-2
ceph-fuse: 12.2.13-pve1

pve-qemu-kvm is my chief suspect. These version changes represent the last batch of updates which we applied on the HVs. Last night I moved some VMs from an HV which had the updates applied and had been rebooted onto another HV in exactly the same state and after some minutes (10-30) most of those VMs crashed with the kernel panic I've shown above. I need to do some further testing but is it possible that VMs started under "pve-qemu-kvm: 5.1.0-8" which is then upgraded could still be expecting to 'see' that version on the host?

BTW, we also experienced this

https://forum.proxmox.com/threads/warning-latest-patch-just-broke-all-my-windows-vms-6-3-4.84915/

with one odd Windows VM. My experience of Proxmox updates over many years now has generally been without any problems at all but I have to say I am a bit skeptical of this last round of patches.

Thanks.

lhall · Mar 1, 2021

So the HV which always seems to be involved in these VM crashes has an "EPYC 7452 32-Core" CPU whereas the others all have Xeons (E5-2620) or similar. The EPYC machine was memtested for over 100hours this weekend and no errors were found.

This post is potentially highlighting the problem though when we commissioned this HV back at the start of last year we had no such issues with moving VMs (30+) on or off of it.

https://forum.proxmox.com/threads/l...-intel-xeon-and-amd-epyc2-linux-guests.68663/

Is it possible/known that "pve-qemu-kvm: 5.2.0-2" or "qemu-server: 6.3-5" has reverted an issue which was previously present? Anyone have any ideas for testing? I will try the suggestions in that thread in the meantime however I'm not super keen on running with a bespoke CPU type unless absolutely necessary. This just wasn't required 6 months ago.

RolandK · Jul 10, 2021

could you check if cpu flags in /proc/cpuinfo on the source and target machine ?

i have seen them differ on two machines with identical hardware but with different proxmox/kernel versions, i.e. i have a cluster with identical hardware where i got crashes on live migration (during cluster upgrade) because of that (because VMs cpu was set to HOST)

i fixed that by setting the VM cpu to some specific cpu revision (westmere)

Search

Search

VM crashes with live migration

lhall

Member

Attachments

t.lamprecht

Proxmox Staff Member

lhall

Member

lhall

Member

RolandK

Renowned Member