[SOLVED] VMs freeze with 100% CPU

Hi,

Yes, asking/reporting via their channels is the way to go ;)


All of this information can be found by looking at our patch file: https://git.proxmox.com/?p=pve-kern...c;hb=6810c247a180f3bb1492873cc571c3edd517d8a3

The mainline kernel accidentally fixed the issue in 6.3 with a refactoring of the code:
Code:
Upstream commit ba6e3fe25543 ("KVM: x86/mmu: Grab mmu_invalidate_seq in
kvm_faultin_pfn()") unknowingly fixed the bug in v6.3 when refactoring
how KVM tracks the sequence counter snapshot.

And the stable kernel v6.1 also has the fix, that's where we picked it from:
Code:
(cherry-picked from commit 82d811ff566594de3676f35808e8a9e19c5c864c in stable v6.1.51)
@fiona Thank You very much for sharing this with us.
 
Hey guys!

I also here with this problem. I have a 3 node Proxmox Cluster. Nothing special, no CEPH, no HA, only a simple cluster.
I have 3 different VM with Ubuntu 18.04.06 OS which is randomly have 100% CPU load. The VM's are minimal load everytime. The 100% CPU load for the VM occuer randomly, i didn't do any migration, backup. Unfortunately i can't make yet trace, but comming. Have you any idea?

root@pla3:/var/log# pveversion -v
proxmox-ve: 8.0.1 (running kernel: 6.2.16-3-pve)
pve-manager: 8.0.3 (running version: 8.0.3/bbf3993334bfa916)
pve-kernel-6.2: 8.0.2
pve-kernel-6.2.16-3-pve: 6.2.16-3
ceph-fuse: 17.2.6-pve1+3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx2
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-3
libknet1: 1.25-pve1
libproxmox-acme-perl: 1.4.6
libproxmox-backup-qemu0: 1.4.0
libproxmox-rs-perl: 0.3.0
libpve-access-control: 8.0.3
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.0.5
libpve-guest-common-perl: 5.0.3
libpve-http-server-perl: 5.0.3
libpve-rs-perl: 0.8.3
libpve-storage-perl: 8.0.1
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve3
novnc-pve: 1.4.0-2
proxmox-backup-client: 2.99.0-1
proxmox-backup-file-restore: 2.99.0-1
proxmox-kernel-helper: 8.0.2
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.0.5
pve-cluster: 8.0.1
pve-container: 5.0.3
pve-docs: 8.0.3
pve-edk2-firmware: 3.20230228-4
pve-firewall: 5.0.2
pve-firmware: 3.7-1
pve-ha-manager: 4.0.2
pve-i18n: 3.0.4
pve-qemu-kvm: 8.0.2-3
pve-xtermjs: 4.16.0-3
qemu-server: 8.0.6
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.1.12-pve1

VM config:
agent: 1
boot: order=scsi0;ide2;net0
cores: 24
cpu: x86-64-v2-AES
ide2: none,media=cdrom
memory: 8192
meta: creation-qemu=8.0.2,ctime=1695639420
name: pla-galera-3
net0: virtio=C6:5F:BB:36:F5:48,bridge=vmbr0,firewall=1
numa: 0
ostype: l26
scsi0: local-zfs:vm-112-disk-0,cache=unsafe,format=raw,iothread=1,size=80G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=b1561c7c-f444-44af-8161-169052010036
sockets: 1
vmgenid: d597c47e-634b-486f-8a18-1c622bd7b576
 
I also here with this problem. I have a 3 node Proxmox Cluster. Nothing special, no CEPH, no HA, only a simple cluster.
I have 3 different VM with Ubuntu 18.04.06 OS which is randomly have 100% CPU load. The VM's are minimal load everytime. The 100% CPU load for the VM occuer randomly, i didn't do any migration, backup. Unfortunately i can't make yet trace, but comming. Have you any idea?
For PVE 8, the issue described in this thread was fixed in kernel 6.2.16-12 and newer (see [1]). You are running kernel 6.2.16-3,which is still affected by the issue. I'd suggest update your system to PVE 8.1 [2] (this should automatically pull a newer kernel). Alternatively you can manually install a newer kernel. After a reboot into the new kernel, the freeze issue described here should not happen anymore.

If you still see freezes even with a newer kernel, please open a new thread.

[1] https://forum.proxmox.com/threads/vms-freeze-with-100-cpu.127459/post-587633
[2] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#system_software_updates
 
  • Like
Reactions: TDex96
Hi, I was on kernel 6.5.11-7-pve-signed having this problem and I rolled back to kernel 6.2.16-20-pve but problem persists.
I am also using intel_iommu to passthrough the GPU, and I thik this might be the source of the problem.

GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on,relax_rmrr intremap=no_x2apic_optout vfio-pci.ids=0000:0b:00 video=vesafb:off video=efifb:off video=simplefb:off initcall_blacklist=sysfb_init nofb nomodeset pcie_acs_override=downstream,multifunction"

I have 2 VMs one with Cloudlinux and another with windows.

I tested by isolating the CPUs from proxmox using GRUB_CMDLINE_LINUX="isolcpus=44-63" but I never reached to correctly set the CPUS to each machine using the classic cpu pointing script as every time I turn on the VM, aleatory CPUs were stablished for Cloudlinux machine, but working fine for Windows machine.

Anyway, I can say that every time I perform a VM backup, a single core uses 100% of the core and all system freezes.

I am using 120Gb ram and VirtIO SCSI single (IO threads disabled or enabled) and Async IO: threads. I tested everything and checked all forums. The situation is terrible, I am getting crazy.
 
Anyway, I can say that every time I perform a VM backup, a single core uses 100% of the core and all system freezes.

I am using 120Gb ram and VirtIO SCSI single (IO threads disabled or enabled) and Async IO: threads. I tested everything and checked all forums. The situation is terrible, I am getting crazy.
Are you using pve-qemu-kvm=8.1.2-5? That very much sounds like the issue reported here and you should upgrade to the pve-qemu-kvm>=8.1.2-6 and shutdown+start your VMs (or migrate to an already upgraded node), so they will run with the new QEMU binary.
 
I had an issue with my Proxmox system reaching 100% CPU utilisation.

Proxmox 8.1 and kernel 6.2 and 6.5

This was caused by, or represented in, my FreeBSD VM reaching high 90s% of CPU usage.

The FreeBSD VM was semi functional. It had connectivity and preformed its tasks, the console was accessible form Proxmox but the web interface was unavailable due to the high load.

I tried upgrading the kernel from 6.2 to 6.5 but this did not work.
Running update-grub fixed the issue and it is now back to 2 – 5 CPU load%.
 
I observed that when the CPU of the virtual machine reaches 100%, after rebooting, I found that the operating system's time is not consistent with the current time. The issue should be here.
 
Hi,
I observed that when the CPU of the virtual machine reaches 100%, after rebooting, I found that the operating system's time is not consistent with the current time. The issue should be here.
the issue from this thread was already resolved in kernels >= 6.2.16-12. So your issue is most likely not the same. Please open a new thread, describing your issue in detail, check your system logs/journal for any further information and share the output of pveversion -v and qm config <ID> replacing <ID> with the ID of your VM.
 
"I have 2 PVE machines, one running version 8.1 and the other running version 8.2. The kernel versions are Linux 6.5.11-7-pve and Linux 6.8.4-2-pve respectively. Both machines are experiencing virtual machine CPU usage at 100%. This is the situation for one of them."
The virtual machine's operating system is Windows 10 LSTC 2021. The memory consumption shown in the image is not accurate; the actual memory consumption should be around 6-10 GB.
1.PNG
2.PNG

Code:
root@pve:~# pveversion -v
proxmox-ve: 8.2.0 (running kernel: 6.8.4-2-pve)
pve-manager: 8.2.2 (running version: 8.2.2/9355359cd7afbae4)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.8: 6.8.4-3
proxmox-kernel-6.8.4-3-pve-signed: 6.8.4-3
proxmox-kernel-6.8.4-2-pve-signed: 6.8.4-2
proxmox-kernel-6.5.13-5-pve-signed: 6.5.13-5
proxmox-kernel-6.5: 6.5.13-5
ceph-fuse: 17.2.6-pve1+3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.1.4
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.6
libpve-cluster-perl: 8.0.6
libpve-common-perl: 8.2.1
libpve-guest-common-perl: 5.1.1
libpve-http-server-perl: 5.1.0
libpve-network-perl: 0.9.8
libpve-rs-perl: 0.8.8
libpve-storage-perl: 8.2.1
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.2.2-1
proxmox-backup-file-restore: 3.2.2-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.2.3
pve-cluster: 8.0.6
pve-container: 5.1.10
pve-docs: 8.2.2
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.0
pve-firewall: 5.0.7
pve-firmware: 3.11-1
pve-ha-manager: 4.0.4
pve-i18n: 3.2.2
pve-qemu-kvm: 8.1.5-6
pve-xtermjs: 5.3.0-3
qemu-server: 8.2.1
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.3-pve2

root@pve:~# qm config 100
agent: 0
balloon: 0
bios: ovmf
boot: order=ide1;sata0;net0
cores: 16
cpu: host
efidisk0: local-lvm:vm-100-disk-0,efitype=4m,pre-enrolled-keys=1,size=4M
hostpci0: 0000:02:00.0,mdev=nvidia-180,pcie=1
ide1: local-lvm:vm-100-disk-1,size=104858K
localtime: 0
machine: pc-q35-8.1
memory: 32768
meta: creation-qemu=8.1.5,ctime=1714365625
name: bluexiner
net0: e1000=BC:24:11:D9:80:48,bridge=vmbr0,firewall=1
numa: 1
onboot: 1
ostype: win11
parent: ok
sata0: local-lvm:vm-100-disk-2,size=120G
scsihw: virtio-scsi-single
smbios1: uuid=1026a512-6dc6-493e-81a1-cba1397ffdcc
sockets: 1
vga: none
vmgenid: d9f79b6a-9cd3-488b-b2fd-c7f9aae55707



root@pve:~# strace -c -p $(cat /var/run/qemu-server/100.pid)
strace: Process 205851 attached
^Cstrace: Process 205851 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 92.76    6.582166         869      7573           ppoll
  3.50    0.248367           8     28752           write
  1.39    0.098354          12      7839       265 futex
  1.19    0.084244          11      7034           recvmsg
  1.04    0.073470           9      7398           read
  0.07    0.004718          33       140           sendmsg
  0.03    0.001882          67        28           close
  0.01    0.001040          37        28           accept4
  0.01    0.000982          35        28           getsockname
  0.01    0.000841          15        56           fcntl
------ ----------- ----------- --------- --------- ----------------
100.00    7.096064         120     58876       265 total
 

Attachments

"I have 2 PVE machines, one running version 8.1 and the other running version 8.2. The kernel versions are Linux 6.5.11-7-pve and Linux 6.8.4-2-pve respectively. Both machines are experiencing virtual machine CPU usage at 100%. This is the situation for one of them."
The virtual machine's operating system is Windows 10 LSTC 2021. The memory consumption shown in the image is not accurate; the actual memory consumption should be around 6-10 GB.
View attachment 67959
View attachment 67960

Code:
root@pve:~# pveversion -v
proxmox-ve: 8.2.0 (running kernel: 6.8.4-2-pve)
pve-manager: 8.2.2 (running version: 8.2.2/9355359cd7afbae4)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.8: 6.8.4-3
proxmox-kernel-6.8.4-3-pve-signed: 6.8.4-3
proxmox-kernel-6.8.4-2-pve-signed: 6.8.4-2
proxmox-kernel-6.5.13-5-pve-signed: 6.5.13-5
proxmox-kernel-6.5: 6.5.13-5
ceph-fuse: 17.2.6-pve1+3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.1.4
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.6
libpve-cluster-perl: 8.0.6
libpve-common-perl: 8.2.1
libpve-guest-common-perl: 5.1.1
libpve-http-server-perl: 5.1.0
libpve-network-perl: 0.9.8
libpve-rs-perl: 0.8.8
libpve-storage-perl: 8.2.1
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.2.2-1
proxmox-backup-file-restore: 3.2.2-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.2.3
pve-cluster: 8.0.6
pve-container: 5.1.10
pve-docs: 8.2.2
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.0
pve-firewall: 5.0.7
pve-firmware: 3.11-1
pve-ha-manager: 4.0.4
pve-i18n: 3.2.2
pve-qemu-kvm: 8.1.5-6
pve-xtermjs: 5.3.0-3
qemu-server: 8.2.1
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.3-pve2

root@pve:~# qm config 100
agent: 0
balloon: 0
bios: ovmf
boot: order=ide1;sata0;net0
cores: 16
cpu: host
efidisk0: local-lvm:vm-100-disk-0,efitype=4m,pre-enrolled-keys=1,size=4M
hostpci0: 0000:02:00.0,mdev=nvidia-180,pcie=1
ide1: local-lvm:vm-100-disk-1,size=104858K
localtime: 0
machine: pc-q35-8.1
memory: 32768
meta: creation-qemu=8.1.5,ctime=1714365625
name: bluexiner
net0: e1000=BC:24:11:D9:80:48,bridge=vmbr0,firewall=1
numa: 1
onboot: 1
ostype: win11
parent: ok
sata0: local-lvm:vm-100-disk-2,size=120G
scsihw: virtio-scsi-single
smbios1: uuid=1026a512-6dc6-493e-81a1-cba1397ffdcc
sockets: 1
vga: none
vmgenid: d9f79b6a-9cd3-488b-b2fd-c7f9aae55707



root@pve:~# strace -c -p $(cat /var/run/qemu-server/100.pid)
strace: Process 205851 attached
^Cstrace: Process 205851 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 92.76    6.582166         869      7573           ppoll
  3.50    0.248367           8     28752           write
  1.39    0.098354          12      7839       265 futex
  1.19    0.084244          11      7034           recvmsg
  1.04    0.073470           9      7398           read
  0.07    0.004718          33       140           sendmsg
  0.03    0.001882          67        28           close
  0.01    0.001040          37        28           accept4
  0.01    0.000982          35        28           getsockname
  0.01    0.000841          15        56           fcntl
------ ----------- ----------- --------- --------- ----------------
100.00    7.096064         120     58876       265 total
There's nothing really special in the trace. How much load would you expect looking inside the guest? Or is the usage also 100% without actual workload running? Maybe that version of Windows does something like Linux with idle=poll kernel commandline?

Is the VM doing heavy IO? You could also try and see if the PCI passthrough influences anything.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!