[SOLVED] VMs freeze with 100% CPU

Hi,

Yes, asking/reporting via their channels is the way to go ;)


All of this information can be found by looking at our patch file: https://git.proxmox.com/?p=pve-kern...c;hb=6810c247a180f3bb1492873cc571c3edd517d8a3

The mainline kernel accidentally fixed the issue in 6.3 with a refactoring of the code:
Code:
Upstream commit ba6e3fe25543 ("KVM: x86/mmu: Grab mmu_invalidate_seq in
kvm_faultin_pfn()") unknowingly fixed the bug in v6.3 when refactoring
how KVM tracks the sequence counter snapshot.

And the stable kernel v6.1 also has the fix, that's where we picked it from:
Code:
(cherry-picked from commit 82d811ff566594de3676f35808e8a9e19c5c864c in stable v6.1.51)
@fiona Thank You very much for sharing this with us.
 
Hey guys!

I also here with this problem. I have a 3 node Proxmox Cluster. Nothing special, no CEPH, no HA, only a simple cluster.
I have 3 different VM with Ubuntu 18.04.06 OS which is randomly have 100% CPU load. The VM's are minimal load everytime. The 100% CPU load for the VM occuer randomly, i didn't do any migration, backup. Unfortunately i can't make yet trace, but comming. Have you any idea?

root@pla3:/var/log# pveversion -v
proxmox-ve: 8.0.1 (running kernel: 6.2.16-3-pve)
pve-manager: 8.0.3 (running version: 8.0.3/bbf3993334bfa916)
pve-kernel-6.2: 8.0.2
pve-kernel-6.2.16-3-pve: 6.2.16-3
ceph-fuse: 17.2.6-pve1+3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx2
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-3
libknet1: 1.25-pve1
libproxmox-acme-perl: 1.4.6
libproxmox-backup-qemu0: 1.4.0
libproxmox-rs-perl: 0.3.0
libpve-access-control: 8.0.3
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.0.5
libpve-guest-common-perl: 5.0.3
libpve-http-server-perl: 5.0.3
libpve-rs-perl: 0.8.3
libpve-storage-perl: 8.0.1
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve3
novnc-pve: 1.4.0-2
proxmox-backup-client: 2.99.0-1
proxmox-backup-file-restore: 2.99.0-1
proxmox-kernel-helper: 8.0.2
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.0.5
pve-cluster: 8.0.1
pve-container: 5.0.3
pve-docs: 8.0.3
pve-edk2-firmware: 3.20230228-4
pve-firewall: 5.0.2
pve-firmware: 3.7-1
pve-ha-manager: 4.0.2
pve-i18n: 3.0.4
pve-qemu-kvm: 8.0.2-3
pve-xtermjs: 4.16.0-3
qemu-server: 8.0.6
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.1.12-pve1

VM config:
agent: 1
boot: order=scsi0;ide2;net0
cores: 24
cpu: x86-64-v2-AES
ide2: none,media=cdrom
memory: 8192
meta: creation-qemu=8.0.2,ctime=1695639420
name: pla-galera-3
net0: virtio=C6:5F:BB:36:F5:48,bridge=vmbr0,firewall=1
numa: 0
ostype: l26
scsi0: local-zfs:vm-112-disk-0,cache=unsafe,format=raw,iothread=1,size=80G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=b1561c7c-f444-44af-8161-169052010036
sockets: 1
vmgenid: d597c47e-634b-486f-8a18-1c622bd7b576
 
I also here with this problem. I have a 3 node Proxmox Cluster. Nothing special, no CEPH, no HA, only a simple cluster.
I have 3 different VM with Ubuntu 18.04.06 OS which is randomly have 100% CPU load. The VM's are minimal load everytime. The 100% CPU load for the VM occuer randomly, i didn't do any migration, backup. Unfortunately i can't make yet trace, but comming. Have you any idea?
For PVE 8, the issue described in this thread was fixed in kernel 6.2.16-12 and newer (see [1]). You are running kernel 6.2.16-3,which is still affected by the issue. I'd suggest update your system to PVE 8.1 [2] (this should automatically pull a newer kernel). Alternatively you can manually install a newer kernel. After a reboot into the new kernel, the freeze issue described here should not happen anymore.

If you still see freezes even with a newer kernel, please open a new thread.

[1] https://forum.proxmox.com/threads/vms-freeze-with-100-cpu.127459/post-587633
[2] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#system_software_updates
 
  • Like
Reactions: TDex96
Hi, I was on kernel 6.5.11-7-pve-signed having this problem and I rolled back to kernel 6.2.16-20-pve but problem persists.
I am also using intel_iommu to passthrough the GPU, and I thik this might be the source of the problem.

GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on,relax_rmrr intremap=no_x2apic_optout vfio-pci.ids=0000:0b:00 video=vesafb:eek:ff video=efifb:eek:ff video=simplefb:eek:ff initcall_blacklist=sysfb_init nofb nomodeset pcie_acs_override=downstream,multifunction"

I have 2 VMs one with Cloudlinux and another with windows.

I tested by isolating the CPUs from proxmox using GRUB_CMDLINE_LINUX="isolcpus=44-63" but I never reached to correctly set the CPUS to each machine using the classic cpu pointing script as every time I turn on the VM, aleatory CPUs were stablished for Cloudlinux machine, but working fine for Windows machine.

Anyway, I can say that every time I perform a VM backup, a single core uses 100% of the core and all system freezes.

I am using 120Gb ram and VirtIO SCSI single (IO threads disabled or enabled) and Async IO: threads. I tested everything and checked all forums. The situation is terrible, I am getting crazy.
 
Anyway, I can say that every time I perform a VM backup, a single core uses 100% of the core and all system freezes.

I am using 120Gb ram and VirtIO SCSI single (IO threads disabled or enabled) and Async IO: threads. I tested everything and checked all forums. The situation is terrible, I am getting crazy.
Are you using pve-qemu-kvm=8.1.2-5? That very much sounds like the issue reported here and you should upgrade to the pve-qemu-kvm>=8.1.2-6 and shutdown+start your VMs (or migrate to an already upgraded node), so they will run with the new QEMU binary.
 
I had an issue with my Proxmox system reaching 100% CPU utilisation.

Proxmox 8.1 and kernel 6.2 and 6.5

This was caused by, or represented in, my FreeBSD VM reaching high 90s% of CPU usage.

The FreeBSD VM was semi functional. It had connectivity and preformed its tasks, the console was accessible form Proxmox but the web interface was unavailable due to the high load.

I tried upgrading the kernel from 6.2 to 6.5 but this did not work.
Running update-grub fixed the issue and it is now back to 2 – 5 CPU load%.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!