Live migration to host with 5.15 kernel (pve7.2) can kill all VMs on this node

udo · Oct 6, 2022

Hi,
the IO-issue like I had in https://forum.proxmox.com/threads/i...5-39-1-pve-bug-soft-lockup-inside-vms.113373/ isn't fixed with vm-disk parameter aio=thread…
Today I migrate an VM live to an Mode with two disks (25G + 75G) and after that, many (all?) VMs on that hosts has issues. CPU goes up and I got "CPU stuck for 94271s" messages in the VM console. That are more than 26 hours, so I read in the logfile entrys with the date of tomorror:

Code:

more /var/log/ha-debug
Oct 07 13:38:40 vappdb04-prod heartbeat: [1409]: info: log-rotate detected on logfile /var/log/ha-debug
Oct 07 13:38:40 vappdb04-prod heartbeat: [1409]: info: log-rotate detected on logfile /var/log/ha-log
Oct 07 13:38:40 vappdb04-prod heartbeat: [1409]: WARN: Gmain_timeout_dispatch: Dispatch function for send local status was delayed 101217210 ms (> 510 ms) be
fore being called (GSource: 0x5619a0d86cc0)
Oct 07 13:38:40 vappdb04-prod heartbeat: [1409]: info: Gmain_timeout_dispatch: started at 1790452651 should have started at 1780330930
Oct 07 13:38:40 vappdb04-prod heartbeat: [1409]: CRIT: Late heartbeat: Node vappdb04-prod: interval 101218210 ms (> deadtime)
Oct 07 13:38:40 vappdb04-prod heartbeat: [1409]: WARN: node 10.XXX.XXX.1: is dead
Oct 07 13:38:40 vappdb04-prod heartbeat: [1409]: WARN: node vappdb03-prod: is dead

this show that the network access is broken too (gateway not reachable).
The pveversion of the target host

Code:

pveversion -v
proxmox-ve: 7.2-1 (running kernel: 5.15.53-1-pve)
pve-manager: 7.2-11 (running version: 7.2-11/b76d3178)
pve-kernel-helper: 7.2-12
pve-kernel-5.15: 7.2-10
pve-kernel-5.11: 7.0-10
pve-kernel-5.15.53-1-pve: 5.15.53-1
pve-kernel-5.11.22-7-pve: 5.11.22-12
pve-kernel-5.11.22-5-pve: 5.11.22-10
pve-kernel-5.11.22-4-pve: 5.11.22-9
ceph-fuse: 15.2.17-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve1
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-2
libpve-guest-common-perl: 4.1-2
libpve-http-server-perl: 4.1-3
libpve-storage-perl: 7.2-8
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.0-3
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
openvswitch-switch: 2.15.0+ds1-2+deb11u1
proxmox-backup-client: 2.2.6-1
proxmox-backup-file-restore: 2.2.6-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.1
pve-cluster: 7.2-2
pve-container: 4.2-2
pve-docs: 7.2-2
pve-edk2-firmware: 3.20220526-1
pve-firewall: 4.2-6
pve-firmware: 3.5-2
pve-ha-manager: 3.4.0
pve-i18n: 2.7-2
pve-qemu-kvm: 7.0.0-3
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.5-pve1

Perhaps the migation-speed with >650MB/s is an problem for kernel to satified the request for the client? But why cpu-stuck??
The host has an AMD-CPU with the actual Bios-Upgrade.
After rebooting the host, the VMs starting normal.
Udo

fiona · Oct 7, 2022

Hi,
it's likely the issue fixed by this commit, which is included in kernel 5.19, but not 5.15. See here for how to install it.

Search

Search

Live migration to host with 5.15 kernel (pve7.2) can kill all VMs on this node

udo

Distinguished Member

Attachments

fiona

Proxmox Staff Member