Live migration to host with 5.15 kernel (pve7.2) can kill all VMs on this node

udo

Distinguished Member
Apr 22, 2009
5,977
199
163
Ahrensburg; Germany
Hi,
the IO-issue like I had in https://forum.proxmox.com/threads/i...5-39-1-pve-bug-soft-lockup-inside-vms.113373/ isn't fixed with vm-disk parameter aio=thread…
Today I migrate an VM live to an Mode with two disks (25G + 75G) and after that, many (all?) VMs on that hosts has issues. CPU goes up and I got "CPU stuck for 94271s" messages in the VM console. That are more than 26 hours, so I read in the logfile entrys with the date of tomorror:
Code:
more /var/log/ha-debug
Oct 07 13:38:40 vappdb04-prod heartbeat: [1409]: info: log-rotate detected on logfile /var/log/ha-debug
Oct 07 13:38:40 vappdb04-prod heartbeat: [1409]: info: log-rotate detected on logfile /var/log/ha-log
Oct 07 13:38:40 vappdb04-prod heartbeat: [1409]: WARN: Gmain_timeout_dispatch: Dispatch function for send local status was delayed 101217210 ms (> 510 ms) be
fore being called (GSource: 0x5619a0d86cc0)
Oct 07 13:38:40 vappdb04-prod heartbeat: [1409]: info: Gmain_timeout_dispatch: started at 1790452651 should have started at 1780330930
Oct 07 13:38:40 vappdb04-prod heartbeat: [1409]: CRIT: Late heartbeat: Node vappdb04-prod: interval 101218210 ms (> deadtime)
Oct 07 13:38:40 vappdb04-prod heartbeat: [1409]: WARN: node 10.XXX.XXX.1: is dead
Oct 07 13:38:40 vappdb04-prod heartbeat: [1409]: WARN: node vappdb03-prod: is dead
this show that the network access is broken too (gateway not reachable).
The pveversion of the target host
Code:
pveversion -v
proxmox-ve: 7.2-1 (running kernel: 5.15.53-1-pve)
pve-manager: 7.2-11 (running version: 7.2-11/b76d3178)
pve-kernel-helper: 7.2-12
pve-kernel-5.15: 7.2-10
pve-kernel-5.11: 7.0-10
pve-kernel-5.15.53-1-pve: 5.15.53-1
pve-kernel-5.11.22-7-pve: 5.11.22-12
pve-kernel-5.11.22-5-pve: 5.11.22-10
pve-kernel-5.11.22-4-pve: 5.11.22-9
ceph-fuse: 15.2.17-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve1
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-2
libpve-guest-common-perl: 4.1-2
libpve-http-server-perl: 4.1-3
libpve-storage-perl: 7.2-8
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.0-3
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
openvswitch-switch: 2.15.0+ds1-2+deb11u1
proxmox-backup-client: 2.2.6-1
proxmox-backup-file-restore: 2.2.6-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.1
pve-cluster: 7.2-2
pve-container: 4.2-2
pve-docs: 7.2-2
pve-edk2-firmware: 3.20220526-1
pve-firewall: 4.2-6
pve-firmware: 3.5-2
pve-ha-manager: 3.4.0
pve-i18n: 2.7-2
pve-qemu-kvm: 7.0.0-3
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.5-pve1
Perhaps the migation-speed with >650MB/s is an problem for kernel to satified the request for the client? But why cpu-stuck??
The host has an AMD-CPU with the actual Bios-Upgrade.
After rebooting the host, the VMs starting normal.
Udo
 

Attachments

  • pve10_crash.jpg
    pve10_crash.jpg
    181.1 KB · Views: 7
Hi,
it's likely the issue fixed by this commit, which is included in kernel 5.19, but not 5.15. See here for how to install it.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!