Locking VMs

coenvl

Member
Feb 3, 2020
16
10
8
37
Dear all,

I have a cluster of 3 proxmox nodes, currently 45 VMs in a research setting. I have been operating this cluster for about 6 years now, started out with proxmox 4, and have been upgrading every time to the latest release according to the upgrade protocol without too much trouble. However, that changed this year in February when I upgraded from proxmox 6.4 to 7.1. Ever since I have been experiencing some issues with the stability of the VMs. First of, I was unable to live migrate some of the VMs back after the upgrade, because of the issue that was also mentioned here: https://forum.proxmox.com/threads/116315. This was unfortunate because I had already started the migration, so I had to force kill some of the VMs. According to that thread, this might be resolved by upgrading to the opt-in kernel 5.19, which indeed was the case. However, after that, some VMs would mysteriously get into this "stuck" state, and I have been unable to find out why, which is why I am posting this question here.

It happens relatively rarely, but often enough to become very annoying. I have been keeping track, and it has occurred exactly 17 times between February 9th, and May 22 (today), which is on average once a week. I am unable to find any pattern in which VMs crash, and which do not, except for the fact that it seems that the VMs that are more actively used, also seem to crash more often.

At first I used to force reset the stuck VMs, but I found out that live migrating the VMs to another node also seems to resolve the stuck state. I am unable to find any logging info, but I am also not sure where to look. On the guests, I find lots of timeout warnings after the VM is released, which makes sense, but nothing that I think points to a probable cause. I cannot find anything on the host machine, but please point me in the right direction if you think there could be something useful.

In attempts to fix this I already completely reinstalled the proxmox host OS on all three nodes. I have enough resources to completely drain a single node at a time, so I did exactly that to do a clean install of proxmox, including installing the opt-in kernel, which is now at version 6.2. I also recently installed an up-to-date bios firmware, and intel-microcode packages, but that seems to have no effect.

Finally some more info on the used hardware and software.

Hardware
host 1: 2x Intel(R) Xeon(R) CPU E5-2640, 384 GB DDR4 RAM
host 2: 2x Intel(R) Xeon(R) CPU E5-2630, 368 GB DDR4 RAM
host 3: 2x Intel(R) Xeon(R) Silver 4114T, 384 GB DDR4 RAM
The host OS is installed on a ZFS mirrored SSD, and the VMs are stored on a Ceph cluster that is internally connected with a 10GB SFP+ fibre network on the backend. The Ceph cluster is running quincy 17.6, and I have no reason to suspect that there are any problems there. Of course, if you think it is relevant I could provide more details.

Software
proxmox-ve: 7.4-1 (running kernel: 6.2.11-2-pve)
pve-manager: 7.4-3 (running version: 7.4-3/9002ab8a)
pve-kernel-6.2: 7.4-3
pve-kernel-5.15: 7.4-3
pve-kernel-6.2.11-2-pve: 6.2.11-2
pve-kernel-6.2.9-1-pve: 6.2.9-1
pve-kernel-5.15.107-2-pve: 5.15.107-2
pve-kernel-5.15.104-1-pve: 5.15.104-2
pve-kernel-5.15.102-1-pve: 5.15.102-1
ceph-fuse: 15.2.17-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4-2
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.4-1
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-3
libpve-rs-perl: 0.7.6
libpve-storage-perl: 7.4-2
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.4.1-1
proxmox-backup-file-restore: 2.4.1-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.6.5
pve-cluster: 7.3-3
pve-container: 4.4-3
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-2
pve-firewall: 4.3-1
pve-firmware: 3.6-5
pve-ha-manager: 3.6.1
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-1
qemu-server: 7.4-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.11-pve1
 
I don't have much to add to your report with your environment and only am replying because I have seen VMs get locked as well. But in my case the are locked in "backup" state with no backup currently being performed. PBS is involved so I was not sure which on is the cause .

Apology if not posted in the correct forum, and not trying to hijack the thread. Just posting in response to "locking vm" title
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!