VM crashes with live migration

lhall

Member
Dec 10, 2020
15
0
6
46
Today we're experiencing an issue with live migration between HVs wherein the VMs (Debian Linux - Buster or Stretch) crash several minutes after the move. The two HVs primarily affected (8 in the cluster, not all tested so far) are running all the versions below and have both today been rebooted onto the latest available 5.4.98-1-pve kernel. The shared storage for these VMs is Ceph and we have had live migration working perfectly well in the past. Screenshots from a couple of VM crashes below, any help/advice anyone can offer would be greatly appreciated.

proxmox-ve: 6.3-1 (running kernel: 5.4.98-1-pve)
pve-manager: 6.3-4 (running version: 6.3-4/0a38c56f)
pve-kernel-5.4: 6.3-5
pve-kernel-helper: 6.3-5
pve-kernel-5.4.98-1-pve: 5.4.98-1
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.55-1-pve: 5.4.55-1
pve-kernel-4.15: 5.4-19
pve-kernel-4.15.18-30-pve: 4.15.18-58
pve-kernel-4.4.134-1-pve: 4.4.134-112
pve-kernel-4.4.35-1-pve: 4.4.35-77
ceph-fuse: 12.2.13-pve1
corosync: 3.1.0-pve1
criu: 3.11-3
glusterfs-client: 8.3-1
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.0.7
libproxmox-backup-qemu0: 1.0.3-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-4
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.1-1
libpve-storage-perl: 6.3-7
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
openvswitch-switch: 2.12.3-1
proxmox-backup-client: 1.0.8-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-5
pve-cluster: 6.2-1
pve-container: 3.3-4
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.2-2
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.2.0-2
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-5
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.3-pve1
 

Attachments

  • VM_Screenshot_2021-02-25_19-00-23.png
    VM_Screenshot_2021-02-25_19-00-23.png
    44.2 KB · Views: 24
  • VM_Screenshot_2021-02-25_19-12-33.png
    VM_Screenshot_2021-02-25_19-12-33.png
    39.1 KB · Views: 22
Hi!

FWIW, we have several such Debian based VMs on hypervisors with ceph running fine with that upgrade + migration since several days.

Could be correlated with the specific environment. What hardware is used for the hypervisor hosts?

Anything in the host syslog during those crashes?
 
Last edited:
Thanks Thomas. These are all Supermicro machines with Xeon E5-2620 and 128 or 256G RAM. We have no reason to suspect that all of these 8 machines have suddenly developed an issue affecting all of them. It seems that the problem with live migration exists across them all.

My suspicion is that the problem has been introduced somewhere between these versions:

proxmox-ve: 6.3-1 (running kernel: 5.4.55-1-pve)
pve-manager: 6.3-3 (running version: 6.3-3/eee5f901)
pve-kernel-5.4: 6.3-3
qemu-server: 6.3-3
pve-qemu-kvm: 5.1.0-8
ceph-fuse: 12.2.13-pve1


proxmox-ve: 6.3-1 (running kernel: 5.4.55-1-pve)
pve-manager: 6.3-4 (running version: 6.3-4/0a38c56f)
pve-kernel-5.4: 6.3-5
qemu-server: 6.3-5
pve-qemu-kvm: 5.2.0-2
ceph-fuse: 12.2.13-pve1

pve-qemu-kvm is my chief suspect. These version changes represent the last batch of updates which we applied on the HVs. Last night I moved some VMs from an HV which had the updates applied and had been rebooted onto another HV in exactly the same state and after some minutes (10-30) most of those VMs crashed with the kernel panic I've shown above. I need to do some further testing but is it possible that VMs started under "pve-qemu-kvm: 5.1.0-8" which is then upgraded could still be expecting to 'see' that version on the host?

BTW, we also experienced this

https://forum.proxmox.com/threads/warning-latest-patch-just-broke-all-my-windows-vms-6-3-4.84915/

with one odd Windows VM. My experience of Proxmox updates over many years now has generally been without any problems at all but I have to say I am a bit skeptical of this last round of patches.

Thanks.
 
So the HV which always seems to be involved in these VM crashes has an "EPYC 7452 32-Core" CPU whereas the others all have Xeons (E5-2620) or similar. The EPYC machine was memtested for over 100hours this weekend and no errors were found.

This post is potentially highlighting the problem though when we commissioned this HV back at the start of last year we had no such issues with moving VMs (30+) on or off of it.

https://forum.proxmox.com/threads/l...-intel-xeon-and-amd-epyc2-linux-guests.68663/

Is it possible/known that "pve-qemu-kvm: 5.2.0-2" or "qemu-server: 6.3-5" has reverted an issue which was previously present? Anyone have any ideas for testing? I will try the suggestions in that thread in the meantime however I'm not super keen on running with a bespoke CPU type unless absolutely necessary. This just wasn't required 6 months ago.
 
could you check if cpu flags in /proc/cpuinfo on the source and target machine ?

i have seen them differ on two machines with identical hardware but with different proxmox/kernel versions, i.e. i have a cluster with identical hardware where i got crashes on live migration (during cluster upgrade) because of that (because VMs cpu was set to HOST)

i fixed that by setting the VM cpu to some specific cpu revision (westmere)
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!