[SOLVED] VMs freeze with 100% CPU

udo

Distinguished Member
Apr 22, 2009
5,977
199
163
Ahrensburg; Germany
Hi,
for some time now VMs with 100% CPU have been freezing. Unfortunately, this has been happening more and more lately.
Shutdown/console no longer works for this VMs - only powerdown and restart.
I've search the forum before but only found threads about this which are using an special intel cpu (nuk).

The host (different Dell Servers) on which it occurs have different CPUs (Intel, AMD) and also different PVE versions/kernels:
7.2-11 5.19.7-2-pve
7.3-1 5.19.17-2-pve
7.4-3 6.2.6-1-pve

It also occurs with server with current BIOS.
It occurs more often on Windows VMs (Win 2016 server), but also on Ubuntu machines with different OS/kernels (18.04, 20.04, 22.04).

Does anyone have any tips to get the known stability back?

Udo
 
  • Like
Reactions: Pakillo77
I have a openmediavault VM on latest Proxmox 7.4 which also freezes from time to time (100% CPU... can't find any hint within the logs). This VM was running for years without hickups.
 
We also experience this, about once a week for several of our Windows VMs. Any pointers on how to troubleshoot this would be most welcome. I discovered today that if doing a reset the CPU got back to normal levels, but the VM still does not respond to anything in the console or network. Hard to debug when there is nothing in the logs and VM is freezing.
 
  • Like
Reactions: Pakillo77
This is definitely an issue. I've posted a couple of "me too" comments in other threads, but no apparent solution. I know other people are having this problem so it's not an isolated one.

Almost impossible to debug, isn't it? Nothing logged, nothing sent to netconsole, nada. I reported that if monitoring the host with top/htop, you see the "stuck" VE/qemu process hopping between CPU cores, so the host is apparently attempting something and isn't a crashed process.

I've tried swapping between kernels, but (currently trying a 6.x kernel) which *appears* to be stable, but I have a feeling that if I reboot the host, it'll start acting up again. I cannot find any pattern to the issue presenting itself.
 
  • Like
Reactions: Pakillo77
We're trying to find any similarities between VMs that have this issue and wonder if you have ballooning enabled on the VMs? Also we wonder what machine version are configured, we run pc-i440fx-5.1 and pc-i440fx-6.x. Also we experience this issue on memory hungry VMs... Kernel version we run is 6.2.9 and 6.1.2.
 
Ballooning may be a good hint. I did disable it for most of my VMs because I had already issues in the past (years ago). But the currently affected VM did have ballooning enabled. Just disabled it and will have a close look to it.
 
We're trying to find any similarities between VMs that have this issue and wonder if you have ballooning enabled on the VMs? Also we wonder what machine version are configured, we run pc-i440fx-5.1 and pc-i440fx-6.x. Also we experience this issue on memory hungry VMs... Kernel version we run is 6.2.9 and 6.1.2.
Hi,
in our case it's mostly the same - memory hungy VMs. Kernel is 6.2.9-1-pve.

I will disable ballooning for all VMs on one server (two of them freezed often). Hopefully it's work more stable after that.

Udo
 
Hi,
in our case it's mostly the same - memory hungy VMs. Kernel is 6.2.9-1-pve.

I will disable ballooning for all VMs on one server (two of them freezed often). Hopefully it's work more stable after that.

Udo
Hi,
unfortunality one VM, with disabled ballooning, freezed with 100% again - after 5 days without trouble (but the VMs freezed before are also not so often, app. once a week).

The VM-config:
Code:
agent: 1
balloon: 0
bootdisk: scsi0
cores: 2
cpu: host
ide2: none,media=cdrom
memory: 16384
name: win-server
net0: virtio=F2:CA:56:36:70:BD,bridge=vmbr0,firewall=1,tag=61
numa: 0
onboot: 1
ostype: win10
scsi0: pve02pool:vm-206-disk-0,aio=threads,discard=on,iothread=1,size=75G
scsihw: virtio-scsi-single
smbios1: uuid=5e98596e-cc99-48ce-9b22-e987bcbc87ce
sockets: 2
virtio0: pve02pool:vm-206-disk-1,aio=threads,discard=on,iothread=1,size=250G
vmgenid: e8606acd-c229-4b68-887c-324dfd660df7
pve-version
Code:
pveversion -v
proxmox-ve: 7.4-1 (running kernel: 6.2.11-2-pve)
pve-manager: 7.4-3 (running version: 7.4-3/9002ab8a)
pve-kernel-6.2: 7.4-3
pve-kernel-5.15: 7.4-3
pve-kernel-6.2.11-2-pve: 6.2.11-2
pve-kernel-6.2.9-1-pve: 6.2.9-1
pve-kernel-6.2.6-1-pve: 6.2.6-1
pve-kernel-5.15.107-2-pve: 5.15.107-2
pve-kernel-5.15.104-1-pve: 5.15.104-2
ceph-fuse: 14.2.21-1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: 0.8.36+pve2
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4-2
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.3-4
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-3
libpve-rs-perl: 0.7.5
libpve-storage-perl: 7.4-2
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
openvswitch-switch: 2.15.0+ds1-2+deb11u4
proxmox-backup-client: 2.4.1-1
proxmox-backup-file-restore: 2.4.1-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-offline-mirror-helper: 0.5.1-1
proxmox-widget-toolkit: 3.6.5
pve-cluster: 7.3-3
pve-container: 4.4-3
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-2
pve-firewall: 4.3-2
pve-firmware: 3.6-5
pve-ha-manager: 3.6.1
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-1
pve-zsync: 2.2.3
qemu-server: 7.4-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.11-pve1
Any hints are wellcome!

Udo
 
Hi there!

I am in this issue to. I have running Windows Server 2022 virtual machine, enabled balooning memory. Then i have call from my customer, that server is very loud and nothing is working (intrenet and windows shares). I log in from remote in my proxmox VE and i can see - one of my virtual machine "Windows servere 2022" stuck at 100% with all 4 cores. It's sad. I must stop this virtual machine, shutdown does'nt work either.
I must even remove lock file: /var/run/lock ..... to stop this VM and have control over it. It is sad, that i must remove this file lock :(

No clue, what cause this issue - after reboot and (now disabling memory baloon) all seems to be fine, but how long ? Another 2 VM are going fine, no issue as so far.

proxmox-ve: 7.4-1 (running kernel: 6.2.11-2-pve)
pve-manager: 7.4-3 (running version: 7.4-3/9002ab8a)
pve-kernel-6.2: 7.4-3
pve-kernel-5.15: 7.4-3
pve-kernel-6.2.11-2-pve: 6.2.11-2
pve-kernel-6.2.11-1-pve: 6.2.11-1
pve-kernel-5.15.107-2-pve: 5.15.107-2
pve-kernel-5.15.74-1-pve: 5.15.74-1
ceph-fuse: 15.2.17-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx4
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4-3
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.4-1
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-3
libpve-rs-perl: 0.7.6
libpve-storage-perl: 7.4-2
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.4.2-1
proxmox-backup-file-restore: 2.4.2-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.7.0
pve-cluster: 7.3-3
pve-container: 4.4-3
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-2
pve-firewall: 4.3-2
pve-firmware: 3.6-5
pve-ha-manager: 3.6.1
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-1
qemu-server: 7.4-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.11-pve1
 
Same problem here.

Hardware is a Dell PowerEdge R7525. Happened several times with a debian buster VM and today the first time with 2 windows server 2016 vms simultaneously. VMs ran at 100% cpu and didnt react to anything. Had to hit the stop-button.


Code:
proxmox-ve: 7.4-1 (running kernel: 5.19.17-2-pve)
pve-manager: 7.4-3 (running version: 7.4-3/9002ab8a)
pve-kernel-5.15: 7.4-3
pve-kernel-5.19: 7.2-15
pve-kernel-5.19.17-2-pve: 5.19.17-2
pve-kernel-5.19.17-1-pve: 5.19.17-1
pve-kernel-5.19.7-2-pve: 5.19.7-2
pve-kernel-5.15.107-2-pve: 5.15.107-2
pve-kernel-5.15.104-1-pve: 5.15.104-2
pve-kernel-5.15.85-1-pve: 5.15.85-1
pve-kernel-5.15.83-1-pve: 5.15.83-1
pve-kernel-5.15.74-1-pve: 5.15.74-1
pve-kernel-5.15.30-2-pve: 5.15.30-3
ceph-fuse: 15.2.16-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx4
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4-3
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.4-1
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-3
libpve-rs-perl: 0.7.6
libpve-storage-perl: 7.4-2
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.4.2-1
proxmox-backup-file-restore: 2.4.2-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-offline-mirror-helper: 0.5.1-1
proxmox-widget-toolkit: 3.7.0
pve-cluster: 7.3-3
pve-container: 4.4-3
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-2
pve-firewall: 4.3-2
pve-firmware: 3.6-5
pve-ha-manager: 3.6.1
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-1
qemu-server: 7.4-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.11-pve1
 
I have a similar problem with a Windows Server 2022 Vm on an HP PROLIANT 360G9 server. Everything is up to date and i use the latest 6.2 kernel.
Sometimes this vm started using 100% cpu and froze. Then my rdp users, around 50 of them, get disconnected.
I have taken some steps that somehow have largely reduced the problem but not resolved it.

1 - Disable KSM
2 - Set zfs_arc_max=8589934592
3 - Set Swappiness=0
4 - And the config that seems to prevent the vm from using 100% cpu.... to set the cpu cores on all my vms, i only account for real cores, no hyper-threading. So in my case, my server has 2 intel E5-2699V4 cpu with 22 real cores each (a total of 88 with hyper-threading). I only use the 44 real cores so i set my windows server vm with 11 coresx2 sockets and numa enabled. The remaining 22 cores are divided the same way with the rest of my vms, always with numa and 2 sockets. That leaves proxmox with enough cpu power to handle the problem.

Since i have done that, even when the problem arise, my windows vm use around 50% max and do not freeze completely. If i restart the vm, the problem desapear.
 
Same problem here.
Different clusters, different CPUs, different brands, CEPH storage or ZFS, Linux VMs, Windows VMs... I cannot find anything to track.
It's very very annoying
 
Hi,
when a VM gets stuck, you can run strace -c -p $(cat /var/run/qemu-server/<ID>.pid) with the ID of the VM. Press Ctrl+C after about 10 seconds to get the output.

You can also install debugger and debug symbols apt install pve-qemu-kvm-dbg gdb and then run gdb --batch --ex 't a a bt' -p $(cat /var/run/qemu-server/<ID>.pid).

When you share this information, please also share the output of qm config <ID> and pveversion -v to make it easier to correlate things.

If we're lucky, those will give some idea where it's stuck.

If you don't have latest microcode and BIOS update installed, please try that first.
 
So I have a similar experience. I recently posted on the forum as well (see here), but think that this thread is about the exact same issue. I started seeing this pattern after upgrading from proxmox 6 to 7, and have freezing VMs roughly once a week . I already tried some fixes on the host including:

- upgrading to the opt-in kernel
- upgrading host BIOS
- installing intel-microcodes

Next time I get another VM freezing I will definitely try to get some logging information using the instructions from fiona.

I do have a suggestion, which usually works for me. When I live migrate a "frozen" VM, it becomes operational again after migration. Of course this only applies if you have more than one host in your cluster.
 
  • Like
Reactions: Pakillo77
We're probably in the same boat as well, we're seing some very high load Debian 11 VMs (high load on CPU, RAM and I/O) freezing randomly. Sometimes they run without issues for weeks and sometimes they crash after days following a reboot.

In our case, the common determinator is that they run on AMD processors (Ryzen 7950X, 5900X). When the same VMs run on intel processors, they have never crashed so far. We have also tried the same as @coenvl, but to no avail.

To somehow workaround the issue, we have also configured watchdogs on all VMs, but surprizingly the watchdog never worked when the VMs froze. I could trigger the watchdog to reboot the VM when manually causing a kernel panic, but obviously the state the VMs are in when they freeze is beyond the watchdog's reach.

Hopefully strace & gdb bring some light to it next time it happens.
 
Last edited:
Hi,
unfortunality one VM, with disabled ballooning, freezed with 100% again - after 5 days without trouble (but the VMs freezed before are also not so often, app. once a week).

The VM-config:
Code:
agent: 1
balloon: 0
bootdisk: scsi0
cores: 2
cpu: host
ide2: none,media=cdrom
memory: 16384
name: win-server
net0: virtio=F2:CA:56:36:70:BD,bridge=vmbr0,firewall=1,tag=61
numa: 0
onboot: 1
ostype: win10
scsi0: pve02pool:vm-206-disk-0,aio=threads,discard=on,iothread=1,size=75G
scsihw: virtio-scsi-single
smbios1: uuid=5e98596e-cc99-48ce-9b22-e987bcbc87ce
sockets: 2
virtio0: pve02pool:vm-206-disk-1,aio=threads,discard=on,iothread=1,size=250G
vmgenid: e8606acd-c229-4b68-887c-324dfd660df7
pve-version
Code:
pveversion -v
proxmox-ve: 7.4-1 (running kernel: 6.2.11-2-pve)
pve-manager: 7.4-3 (running version: 7.4-3/9002ab8a)
pve-kernel-6.2: 7.4-3
pve-kernel-5.15: 7.4-3
pve-kernel-6.2.11-2-pve: 6.2.11-2
pve-kernel-6.2.9-1-pve: 6.2.9-1
pve-kernel-6.2.6-1-pve: 6.2.6-1
pve-kernel-5.15.107-2-pve: 5.15.107-2
pve-kernel-5.15.104-1-pve: 5.15.104-2
ceph-fuse: 14.2.21-1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: 0.8.36+pve2
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4-2
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.3-4
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-3
libpve-rs-perl: 0.7.5
libpve-storage-perl: 7.4-2
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
openvswitch-switch: 2.15.0+ds1-2+deb11u4
proxmox-backup-client: 2.4.1-1
proxmox-backup-file-restore: 2.4.1-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-offline-mirror-helper: 0.5.1-1
proxmox-widget-toolkit: 3.6.5
pve-cluster: 7.3-3
pve-container: 4.4-3
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-2
pve-firewall: 4.3-2
pve-firmware: 3.6-5
pve-ha-manager: 3.6.1
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-1
pve-zsync: 2.2.3
qemu-server: 7.4-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.11-pve1
Any hints are wellcome!

Udo
Still happens this weekend again (with two different VMs). Have now disabled KSM… perhaps it's will help.

Udo
 
Hi,

You guys are not alone, we have the exact same issue.

This is happening on our nodes with kernel 5.19, 6.0 and 6.1 pve nodes (pve 7.2) with 5.15 have no issue.

Tried bios upgrade, cpu micro code, netconsole nothing at all.

We are currently trying to resetup a clean cluster on 7.4 with kernel 5.15 since our production have a cluster with mixed version ( stopped upgrading nodes when we had the first issue)

If you guys find a solution that will be great. If the crash happen again (every week for us), I will try what fiona asked
 
  • Like
Reactions: udo
I am having the same issue, exclusively on a TrueNAS Scale VM. Other VMs on the server run fine.

I've had a cronjob running on the VM that piped formatted `top` output to a file but that doesn't report anything once this 100% CPU load happens, leading me to believe that this is caused by Proxmox, not TrueNAS (note the gap in the output):

2023-06-05T14:25:01-0400,5,root,0,-20,0,0,0,I,0.0,0.0,0:00.00,slub_fl+
2023-06-05T19:55:01-0400,3127,root,20,0,283448,58012,9264,S,6.7,0.3,0:01.01,cli

Only other processes that legitimately use 100% CPU occasionally are `z_wr_iss` and `smbd`, but those don't correlate with the locking issue.

Also an AMD CPU, Ryzen 2200G. All BIOS etc. etc. up-to-date.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!