VM's occasionally hang, is the following cronjob to reset vm's a good idea?

surfrock66

Well-Known Member
Feb 10, 2020
48
9
48
41
I am having an issue where I have 3 hypervisors in a cluster, one is substantially different than the others (totally different CPU generation). Eventually I will replace them all, but that takes time. Until then, I have the CPU type set to the lowest compatible version, and for the most part, it works. What I do find though is, say I do maintenance and migrate a vm, when I migrate it back, some time in the next 48 hours the VM pegs the CPU at ~15-25% and then hangs.

1738874341395.png

I can detect this with "qm guest cmd ### info" and issue a "qm reset ###" but I want this automated. I find if I make the following a cron job every 5 minutes, I can essentially find and fix it automatically:

*/5 * * * * for vm in $(/usr/sbin/qm list | awk '{print $1}' | grep -Eo '[0-9]{1,3}'); do if [ $(/usr/sbin/qm guest cmd $vm info 2>&1 | grep -e "not running" | wc -l) -eq 1 ]; then /usr/sbin/qm reset $vm; fi; done

In theory this works, and if I were doing maintenance where I new a guest would be offline, it'd be easy enough to disable the cron job. My question is, is there a better way to do this? Is there some big obvious reason I'm not thinking of that this is a stupid idea?
 
Last edited:
Hello surfrock66! I would personally recommend trying to fix the issue (if possible) instead of relying on workarounds. However, at this point we would need more information in order to help:
  1. What is the hardware configuration of each of the servers in the cluster?
  2. What is the configuration of the VM in question?
  3. Is this a Linux VM? If yes, please try running top or htop to find out what uses the CPU.
  4. If this is a Linux VM, it also makes sense to look at the journal - run journalctl --since <TIME> in the VM, preferably with a time at least 30 minutes before the issues begin to happen, but also around both migrations (you can also use --until for that purpose).
 
  • Like
Reactions: ghusson
I'd been down this troubleshooting path before, but the answer was "expect unexpected stability on clusters with vastly different hardware."

1) 2 are Dell Poweredge R720's with 2x Xeon X5687's, one is a Dell Poweredge R6525 EPYC 7252. As budget allows this is to be the standard for replacing the others.
2) All my VM's are the same, and they're using CPU Type "x86-64-v2-AES" which was, at the time, the maximum common architecture set I could do across those hypervisor CPU's. The systems have various cores/sockets allocated, and all have NUMA.
3) They're all linux, but when in that state, I can't log in, run processes, etc...they're full hung. When I reset them, there's nothing anomalous.
4) The logs stop logging at the hang; I get nothing specific, it's just operation as normal -> stop -> boot.
 
Alright, then:
  1. Could you please post the configuration of the VMs in question?
  2. Does the issue happen with all VMs, or just a specific one?
  3. Please post the output of pveversion -v from all servers.
 
1) Sample for a VM on the host in question (my Collabora server)
Code:
root@sr66-prox-03:~# qm config 116
agent: 1,fstrim_cloned_disks=1
boot: order=ide2;scsi0
cores: 4
cpu: x86-64-v2-AES
hotplug: disk,network,usb,memory,cpu
ide2: none,media=cdrom
memory: 4096
meta: creation-qemu=8.0.2,ctime=1696607428
name: sr66-cool-01
net0: virtio=BA:E6:A8:75:ED:D1,bridge=vmbr0,firewall=1,tag=2
numa: 1
onboot: 1
ostype: l26
scsi0: sr66-nas-2-lun01-10G-thin-hispeed:vm-116-disk-0,iothread=1,size=32G
scsihw: virtio-scsi-single
smbios1: uuid=a61c60cb-565b-4934-8e2f-699662990648
sockets: 2
startup: order=7,up=0
vcpus: 8
vmgenid: dc668049-0bf8-48da-83c7-7fd3b74db2e0
2) This is all vm's on a specific host.
3a) Host 1
Code:
root@sr66-prox-01:~# pveversion -v
proxmox-ve: 8.3.0 (running kernel: 6.8.12-8-pve)
pve-manager: 8.3.3 (running version: 8.3.3/f157a38b211595d6)
proxmox-kernel-helper: 8.1.0
pve-kernel-6.2: 8.0.5
proxmox-kernel-6.8: 6.8.12-8
proxmox-kernel-6.8.12-8-pve-signed: 6.8.12-8
proxmox-kernel-6.8.12-5-pve-signed: 6.8.12-5
proxmox-kernel-6.8.12-4-pve-signed: 6.8.12-4
proxmox-kernel-6.8.12-2-pve-signed: 6.8.12-2
proxmox-kernel-6.8.12-1-pve-signed: 6.8.12-1
proxmox-kernel-6.8.8-2-pve-signed: 6.8.8-2
proxmox-kernel-6.8.4-3-pve-signed: 6.8.4-3
proxmox-kernel-6.8.4-2-pve-signed: 6.8.4-2
proxmox-kernel-6.5.13-6-pve-signed: 6.5.13-6
proxmox-kernel-6.5: 6.5.13-6
proxmox-kernel-6.5.13-5-pve-signed: 6.5.13-5
proxmox-kernel-6.5.13-3-pve-signed: 6.5.13-3
proxmox-kernel-6.5.11-7-pve-signed: 6.5.11-7
proxmox-kernel-6.2.16-20-pve: 6.2.16-20
proxmox-kernel-6.2: 6.2.16-20
proxmox-kernel-6.2.16-19-pve: 6.2.16-19
proxmox-kernel-6.2.16-14-pve: 6.2.16-14
proxmox-kernel-6.2.16-12-pve: 6.2.16-12
pve-kernel-6.2.16-3-pve: 6.2.16-3
ceph-fuse: 17.2.6-pve1+3
corosync: 3.1.7-pve3
criu: 3.17.1-2+deb12u1
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx11
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.4
libpve-access-control: 8.2.0
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.10
libpve-cluster-perl: 8.0.10
libpve-common-perl: 8.2.9
libpve-guest-common-perl: 5.1.6
libpve-http-server-perl: 5.1.2
libpve-network-perl: 0.10.0
libpve-rs-perl: 0.9.1
libpve-storage-perl: 8.3.3
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.5.0-1
openvswitch-switch: 3.1.0-2+deb12u1
proxmox-backup-client: 3.3.2-1
proxmox-backup-file-restore: 3.3.2-2
proxmox-firewall: 0.6.0
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.3.1
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.3.4
pve-cluster: 8.0.10
pve-container: 5.2.3
pve-docs: 8.3.1
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.2
pve-firewall: 5.1.0
pve-firmware: 3.14-3
pve-ha-manager: 4.0.6
pve-i18n: 3.3.3
pve-qemu-kvm: 9.0.2-5
pve-xtermjs: 5.3.0-3
qemu-server: 8.3.6
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.7-pve1
3b) Host 2
Code:
root@sr66-prox-02:~# pveversion -v
proxmox-ve: 8.3.0 (running kernel: 6.8.12-8-pve)
pve-manager: 8.3.3 (running version: 8.3.3/f157a38b211595d6)
proxmox-kernel-helper: 8.1.0
pve-kernel-6.2: 8.0.5
proxmox-kernel-6.8: 6.8.12-8
proxmox-kernel-6.8.12-8-pve-signed: 6.8.12-8
proxmox-kernel-6.8.12-5-pve-signed: 6.8.12-5
proxmox-kernel-6.8.12-4-pve-signed: 6.8.12-4
proxmox-kernel-6.8.12-2-pve-signed: 6.8.12-2
proxmox-kernel-6.8.12-1-pve-signed: 6.8.12-1
proxmox-kernel-6.8.8-2-pve-signed: 6.8.8-2
proxmox-kernel-6.8.4-3-pve-signed: 6.8.4-3
proxmox-kernel-6.8.4-2-pve-signed: 6.8.4-2
proxmox-kernel-6.5.13-6-pve-signed: 6.5.13-6
proxmox-kernel-6.5: 6.5.13-6
proxmox-kernel-6.5.13-5-pve-signed: 6.5.13-5
proxmox-kernel-6.5.13-3-pve-signed: 6.5.13-3
proxmox-kernel-6.5.11-7-pve-signed: 6.5.11-7
proxmox-kernel-6.2.16-20-pve: 6.2.16-20
proxmox-kernel-6.2: 6.2.16-20
proxmox-kernel-6.2.16-19-pve: 6.2.16-19
proxmox-kernel-6.2.16-14-pve: 6.2.16-14
proxmox-kernel-6.2.16-12-pve: 6.2.16-12
pve-kernel-6.2.16-3-pve: 6.2.16-3
ceph-fuse: 17.2.6-pve1+3
corosync: 3.1.7-pve3
criu: 3.17.1-2+deb12u1
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx11
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.4
libpve-access-control: 8.2.0
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.10
libpve-cluster-perl: 8.0.10
libpve-common-perl: 8.2.9
libpve-guest-common-perl: 5.1.6
libpve-http-server-perl: 5.1.2
libpve-network-perl: 0.10.0
libpve-rs-perl: 0.9.1
libpve-storage-perl: 8.3.3
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.5.0-1
openvswitch-switch: 3.1.0-2+deb12u1
proxmox-backup-client: 3.3.2-1
proxmox-backup-file-restore: 3.3.2-2
proxmox-firewall: 0.6.0
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.3.1
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.3.4
pve-cluster: 8.0.10
pve-container: 5.2.3
pve-docs: 8.3.1
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.2
pve-firewall: 5.1.0
pve-firmware: 3.14-3
pve-ha-manager: 4.0.6
pve-i18n: 3.3.3
pve-qemu-kvm: 9.0.2-5
pve-xtermjs: 5.3.0-3
qemu-server: 8.3.6
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.7-pve1
3c) Host 3 (the trouble host)
Code:
root@sr66-prox-03:~# pveversion -v
proxmox-ve: 8.3.0 (running kernel: 6.8.12-8-pve)
pve-manager: 8.3.3 (running version: 8.3.3/f157a38b211595d6)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.8: 6.8.12-8
proxmox-kernel-6.8.12-8-pve-signed: 6.8.12-8
proxmox-kernel-6.8.12-5-pve-signed: 6.8.12-5
proxmox-kernel-6.8.12-4-pve-signed: 6.8.12-4
proxmox-kernel-6.8.12-2-pve-signed: 6.8.12-2
proxmox-kernel-6.8.4-2-pve-signed: 6.8.4-2
ceph-fuse: 17.2.7-pve3
corosync: 3.1.7-pve3
criu: 3.17.1-2+deb12u1
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx11
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.4
libpve-access-control: 8.2.0
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.10
libpve-cluster-perl: 8.0.10
libpve-common-perl: 8.2.9
libpve-guest-common-perl: 5.1.6
libpve-http-server-perl: 5.1.2
libpve-network-perl: 0.10.0
libpve-rs-perl: 0.9.1
libpve-storage-perl: 8.3.3
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.5.0-1
openvswitch-switch: 3.1.0-2+deb12u1
proxmox-backup-client: 3.3.2-1
proxmox-backup-file-restore: 3.3.2-2
proxmox-firewall: 0.6.0
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.3.1
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.7
proxmox-widget-toolkit: 4.3.4
pve-cluster: 8.0.10
pve-container: 5.2.3
pve-docs: 8.3.1
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.2
pve-firewall: 5.1.0
pve-firmware: 3.14-3
pve-ha-manager: 4.0.6
pve-i18n: 3.3.3
pve-qemu-kvm: 9.0.2-5
pve-xtermjs: 5.3.0-3
qemu-server: 8.3.6
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.7-pve1
 
Thanks for the information! As you probably know already, the Proxmox VE documentation states the following:
Code:
Live migrations between Intel and AMD host CPUs have no guarantee to work.

However, I'm not sure whether that's actually the issue, since the migration itself seems to work without issues for longer periods of time.

I was also wondering whether that might be a storage issue, where the VM gets stuck because it has no access anymore to the storage. However, in that case I would expect it to restart rather than be stuck with high CPU usage.

While I'm still trying to figure out what could help, I would like to still know the following to get a better overview of the situation:
  1. What storage are you using for the VMs? Do you have them on Ceph?
  2. Do you see anything unusual in the journal of the host? While I don't expect you to find something interesting, it might be worth to check it out as well.

Some things you can try out, but at this point I'm not sure whether they will help:
  • You can try out the opt-in kernel 6.11 in the VMs to see if it improves the situation.
  • Try setting the CPU type to kvm64.
Also, your version of Ceph is EOL - see this thread. Consider updating to either 18.2 or 19.2.
 
No ceph in the environment; these are actually iscsi luns from a truenas scale box. They're pretty rock solid, and I never had an issue when I was dealing with a cluster of 3 identical CPU hypervisors. The other thing that's weird is it's not all at the same time; basically over the next 24 hours after a migration, random vm's will hang at random times.

I've not found anything in the host logs during a hang, but at the same time it's tricky to simulate it. That's why my approach as been the cronjob to react, rather than prevent, because it's been so unpredictable and inconclusive. I've kind of just resigned to the fact that this is a quirk of the cpu type with different cpu vendors, and until I can justify the budget to resolve that (like 5k/host) I think the cronjob will mostly keep the headaches in check.
 
I think, just like you, that the most probable source of issues is the fact that live migrations between different CPU vendors is not guaranteed to work. In your case, it seems to almost work. However, keep in mind that we're talking about live migrations.

Depending on what you're trying to achieve, you might instead prefer to shut down the VM entirely, do an offline migration, then start it again. This way, you at least know that the issues you are describing won't happen at some random point in time.