VM stuck randomly

hac3ru

Member
Mar 6, 2021
45
1
13
33
Hello,

I am running a 3 node Proxmox VE cluster, with a shared storage (iSCSI) and I've got an issue with two of the VMs: they completely freeze. The thing is, I've got clones of these VM which work just fine. So I don't think it's the OS running inside the VM. When it freezes, I can open the console, I see the login screen of Ubuntu 20.04 but that's it. I can't CTRL + ALT + DEL, I can't type anything in, etc.
Code:
# pveversion
pve-manager/8.2.2/9355359cd7afbae4 (running kernel: 6.8.4-3-pve)

The freezes seem to be random, sometimes it works for days / weeks, sometimes it freezes after hours....

Any clue what might be the issue? It only happens to this specific VM and another.
 
Could you please post your journal using the following command:
Code:
journalctl -e
Are you running corosync or ceph or anything like that?
 
Hi,
other than checking the journal, please post the output of pveversion -v and the configuration of the affected VMs qm config <ID>. What does qm status <ID> --verbose say when the VM is frozen? I'd also recommend upgrading to the latest kernel and see if the issue persists.
 
Hello,
@techtomic no, I'm not running corosync nor ceph.

@fiona The VM just got stuck. It seems that it's doing this when I migrate it to different hosts (I updated the cluster this morning, set the hosts - one at a time - into maintenance mode, so the VM migrated off the host and back on). Since it's a production VM, I can't really replicate it, unfortunately...
I've got a ton of journactl messages saying "
Code:
Aug 01 07:23:26 ZRH-GLT-CS03 pveproxy[58355]: got inotify poll request in wrong process - disabling inotify
which I'm not able to see when the VM is not stuck.

Code:
# journalctl -e #nothing out of the ordinary, except this
Aug 01 08:10:59 ZRH-GLT-CS03 pveproxy[148637]: 2024-08-01 08:10:59.681457 +0200 error AnyEvent::Util: Runtime error in AnyEvent::guard callback: Can't call method "_put_session" on an undefined value at /usr/lib/x86_64-linux-gnu/perl5/5>
Aug 01 08:10:59 ZRH-GLT-CS03 pveproxy[2118]: worker 148637 finished
Aug 01 08:10:59 ZRH-GLT-CS03 pveproxy[2118]: starting 1 worker(s)
Aug 01 08:10:59 ZRH-GLT-CS03 pveproxy[2118]: worker 198944 started
Aug 01 08:11:01 ZRH-GLT-CS03 pveproxy[198944]: Clearing outdated entries from certificate cache
Aug 01 08:11:01 ZRH-GLT-CS03 pveproxy[198943]: got inotify poll request in wrong process - disabling inotify
Aug 01 08:11:02 ZRH-GLT-CS03 pveproxy[198943]: worker exit
Aug 01 08:13:45 ZRH-GLT-CS03 pveproxy[157387]: worker exit
Aug 01 08:13:45 ZRH-GLT-CS03 pveproxy[2118]: worker 157387 finished
Aug 01 08:13:45 ZRH-GLT-CS03 pveproxy[2118]: starting 1 worker(s)
Aug 01 08:13:45 ZRH-GLT-CS03 pveproxy[2118]: worker 207273 started
Aug 01 08:13:51 ZRH-GLT-CS03 pveproxy[207273]: Clearing outdated entries from certificate cache
Aug 01 08:16:57 ZRH-GLT-CS03 pvedaemon[160820]: VM 11005 qmp command failed - VM 11005 qmp command 'guest-ping' failed - got timeout

Code:
# pveversion -v
proxmox-ve: 8.2.0 (running kernel: 6.8.8-4-pve)
pve-manager: 8.2.4 (running version: 8.2.4/faa83925c9641325)
proxmox-kernel-helper: 8.1.0
pve-kernel-5.15: 7.4-4
proxmox-kernel-6.8: 6.8.8-4
proxmox-kernel-6.8.8-4-pve-signed: 6.8.8-4
proxmox-kernel-6.8.8-3-pve-signed: 6.8.8-3
proxmox-kernel-6.8.8-2-pve-signed: 6.8.8-2
proxmox-kernel-6.5.13-6-pve-signed: 6.5.13-6
proxmox-kernel-6.5: 6.5.13-6
pve-kernel-5.15.108-1-pve: 5.15.108-1
pve-kernel-5.15.30-2-pve: 5.15.30-3
amd64-microcode: 3.20230808.1.1~deb12u1
ceph-fuse: 16.2.11+ds-2
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx9
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.1.4
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.7
libpve-cluster-perl: 8.0.7
libpve-common-perl: 8.2.1
libpve-guest-common-perl: 5.1.4
libpve-http-server-perl: 5.1.0
libpve-network-perl: 0.9.8
libpve-rs-perl: 0.8.9
libpve-storage-perl: 8.2.3
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.2.7-1
proxmox-backup-file-restore: 3.2.7-1
proxmox-firewall: 0.5.0
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.6
proxmox-widget-toolkit: 4.2.3
pve-cluster: 8.0.7
pve-container: 5.1.12
pve-docs: 8.2.2
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.1
pve-firewall: 5.0.7
pve-firmware: 3.13-1
pve-ha-manager: 4.0.5
pve-i18n: 3.2.2
pve-qemu-kvm: 9.0.2-1
pve-xtermjs: 5.3.0-3
qemu-server: 8.2.3
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.4-pve1

Code:
# qm config 11005
agent: 1
balloon: 0
boot: order=scsi0
cores: 4
memory: 32768
name: <sensitive name removed>
net0: virtio=BC:24:11:9B:3F:E9,bridge=vmbr1,tag=305
scsi0: NAS01:vm-11005-disk-0,size=120G
smbios1: uuid=be894c09-edd1-4752-95e1-dc9af122edfb
startup: order=11000
tags: Client-VM;HA;Production
vmgenid: 03c00d58-72f9-44c2-a888-0e5146932073

Trying to migrate the VM off a few times to see if I can "jam" it again, to provide
Code:
qm status <id> --verbose
P.S. after 10 migrations, the VM is still working....

Thank you!
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!