[SOLVED] VMs freeze with 100% CPU

Same problem for us on 24 of 26 Debian VMs since we upgraded to v8 and new kernel accordingly. Every time we move the VMs e.g. when restarting the host the VM freezes with 40-100% CPU load and wont respond to network requests or terminal.

We use CEPH but don't have file descriptor limit issues and further tuning them did not help either.

We don't use KSM.

We don't want to disable mitigations they are there for a reason.

Code:
strace -c -p 35163
strace: Process 35163 attached
^Cstrace: Process 35163 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.96   36.947836       69061       535           ppoll
  0.02    0.009002          11       814       192 futex
  0.01    0.002379           1      2008           write
  0.00    0.001371           2       517           read
  0.00    0.000726           1       491           recvmsg
  0.00    0.000030           3        10           sendmsg
  0.00    0.000011           5         2           close
  0.00    0.000010           5         2           accept4
  0.00    0.000006           1         4           fcntl
  0.00    0.000003           1         2           getsockname
------ ----------- ----------- --------- --------- ----------------
100.00   36.961374        8429      4385       192 total
 
Last edited:
Hi,
Same problem for us on 24 of 26 Debian VMs since we upgraded to v8 and new kernel accordingly. Every time we move the VMs e.g. when restarting the host the VM freezes with 40-100% CPU load and wont respond to network requests or terminal.
that does sound different from the other reports. If it always happens upon migration, it might be a different issue.
Code:
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.96   36.947836       69061       535           ppoll
  0.02    0.009002          11       814       192 futex
The futex errors were also not present for anybody else I think.

Can you share the output of pveversion -v from source and target node of a problematic migration (maybe create a clone or dummy VM for testing) and the configuration of an affected VM. What kind of CPU models do your hosts in the cluster have?
 
Hello @fiona I recently found another problem which causes my VM to crash instead of freezing. (https://forum.proxmox.com/threads/opnsense-keeps-crashing.131601/)
Also ppoll has a higher number.
The VM OS is a BSD (OPNsense) and since the latest OPNsense version which was release 2 days ago the whole VM is not usable.

Would it be possible the Proxmox reenables Kernel 5.15 as most of this thread users had no issues with this kernel version?
 
Hi,

that does sound different from the other reports. If it always happens upon migration, it might be a different issue.
sorry by move I meant migration e.g. when stopping the host our HA will migrate all VMs before restarting. As well when we manually migrate VMs.
The futex errors were also not present for anybody else I think.

Can you share the output of pveversion -v from source and target node of a problematic migration (maybe create a clone or dummy VM for testing) and the configuration of an affected VM. What kind of CPU models do your hosts in the cluster have?
All our system have the following output with the exception for the newer systems who dont use zfs and proxmox-offline-mirror-helper

Code:
37d36
< proxmox-offline-mirror-helper: 0.6.2
54d52
< zfsutils-linux: 2.1.12-pve1

Code:
proxmox-ve: 8.0.1 (running kernel: 6.2.16-5-pve)
pve-manager: 8.0.3 (running version: 8.0.3/bbf3993334bfa916)
pve-kernel-6.2: 8.0.4
pve-kernel-5.15: 7.4-3
pve-kernel-6.2.16-5-pve: 6.2.16-6
pve-kernel-6.2.16-4-pve: 6.2.16-5
pve-kernel-5.15.107-2-pve: 5.15.107-2
ceph: 17.2.6-pve1+3
ceph-fuse: 17.2.6-pve1+3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown: residual config
ifupdown2: 3.2.0-1+pmx3
libjs-extjs: 7.0.0-3
libknet1: 1.25-pve1
libproxmox-acme-perl: 1.4.6
libproxmox-backup-qemu0: 1.4.0
libproxmox-rs-perl: 0.3.0
libpve-access-control: 8.0.3
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.0.6
libpve-guest-common-perl: 5.0.3
libpve-http-server-perl: 5.0.4
libpve-rs-perl: 0.8.4
libpve-storage-perl: 8.0.2
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve3
novnc-pve: 1.4.0-2
proxmox-backup-client: 3.0.1-1
proxmox-backup-file-restore: 3.0.1-1
proxmox-kernel-helper: 8.0.2
proxmox-mail-forward: 0.2.0
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.2
proxmox-widget-toolkit: 4.0.6
pve-cluster: 8.0.2
pve-container: 5.0.4
pve-docs: 8.0.4
pve-edk2-firmware: 3.20230228-4
pve-firewall: 5.0.3
pve-firmware: 3.7-1
pve-ha-manager: 4.0.2
pve-i18n: 3.0.5
pve-qemu-kvm: 8.0.2-3
pve-xtermjs: 4.16.0-3
qemu-server: 8.0.6
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.1.12-pve1
 
sorry by move I meant migration e.g. when stopping the host our HA will migrate all VMs before restarting. As well when we manually migrate VMs.
Is the Ceph network separate from the migration network? How does CPU/IO/network load look like during a migration? Please share one or two VM configurations.

What CPU models do you have in the cluster?
 
Is the Ceph network separate from the migration network?
The ceph cluster_network is on a dedicated 10Gbit/s interface on 3 CEPH hosts on a dedicated network.
Each of the 3 CEPH host runs a Manager and a Monitor and 4 Meta Data Server (not currently used, just here as a prep).
The ceph public_network is on a dedicated 10Gbit/s interface on the 3 CEPH hosts on a shared network with the 2 other PVE hosts with each as well 10Gbit/s interfaces.
All VMs use CEPH for their system drive, the actual data processed within the VMs is going through a NFS based system unrelated to CEPH.
The 2 additional PVE hosts have CEPH installed but nothing actively running.
All Systems have additionally 1Gbit/s internet facing interfaces.
How does CPU/IO/network load look like during a migration?
The migration is quick, and the CPU is generally increasing as almost all the VMs are in the freeze situation.

Code:
task started by HA resource agent
2023-08-03 22:31:13 use dedicated network address for sending migration traffic (10.x.x.15)
2023-08-03 22:31:14 starting migration of VM 122 to node 'x' (10.x.x.15)
2023-08-03 22:31:14 starting VM 122 on remote node 'x'
2023-08-03 22:31:17 start remote tunnel
2023-08-03 22:31:18 ssh tunnel ver 1
2023-08-03 22:31:18 starting online/live migration on tcp:10.x.x.15:60000
2023-08-03 22:31:18 set migration capabilities
2023-08-03 22:31:18 migration downtime limit: 100 ms
2023-08-03 22:31:18 migration cachesize: 2.0 GiB
2023-08-03 22:31:18 set migration parameters
2023-08-03 22:31:18 start migrate command to tcp:10.x.x.15:60000
2023-08-03 22:31:19 migration active, transferred 296.8 MiB of 16.0 GiB VM-state, 343.8 MiB/s
2023-08-03 22:31:20 migration active, transferred 699.8 MiB of 16.0 GiB VM-state, 562.6 MiB/s
2023-08-03 22:31:21 migration active, transferred 1.2 GiB of 16.0 GiB VM-state, 947.7 MiB/s
2023-08-03 22:31:22 migration active, transferred 1.7 GiB of 16.0 GiB VM-state, 2.7 GiB/s
2023-08-03 22:31:23 migration active, transferred 2.2 GiB of 16.0 GiB VM-state, 629.5 MiB/s
2023-08-03 22:31:24 migration active, transferred 2.7 GiB of 16.0 GiB VM-state, 4.1 GiB/s
2023-08-03 22:31:26 migration active, transferred 3.5 GiB of 16.0 GiB VM-state, 932.1 MiB/s
2023-08-03 22:31:26 average migration speed: 2.0 GiB/s - downtime 254 ms
2023-08-03 22:31:26 migration status: completed
2023-08-03 22:31:29 migration finished successfully (duration 00:00:16)
TASK OK


Code:
task started by HA resource agent
2023-08-03 22:31:13 use dedicated network address for sending migration traffic (10.x.x.15)
2023-08-03 22:31:14 starting migration of VM 103 to node 'x' (10.x.x.15)
2023-08-03 22:31:14 starting VM 103 on remote node 'x'
2023-08-03 22:31:17 start remote tunnel
2023-08-03 22:31:18 ssh tunnel ver 1
2023-08-03 22:31:18 starting online/live migration on tcp:10.x.x.15:60001
2023-08-03 22:31:18 set migration capabilities
2023-08-03 22:31:18 migration downtime limit: 100 ms
2023-08-03 22:31:18 migration cachesize: 256.0 MiB
2023-08-03 22:31:18 set migration parameters
2023-08-03 22:31:18 start migrate command to tcp:10.x.x.15:60001
2023-08-03 22:31:19 migration active, transferred 552.5 MiB of 2.0 GiB VM-state, 1.1 GiB/s
2023-08-03 22:31:20 average migration speed: 1.0 GiB/s - downtime 78 ms
2023-08-03 22:31:20 migration status: completed
2023-08-03 22:31:24 migration finished successfully (duration 00:00:11)
TASK OK
Please share one or two VM configurations.
Code:
cat  /etc/pve/qemu-server/102.conf
agent: 1
boot: c
bootdisk: scsi0
cipassword: x
ciuser: sofo
cores: 4
cpu: host
ide2: kvmdatapool:vm-102-cloudinit,media=cdrom,size=4M
memory: 16384
name: k8s-x-n1
nameserver: 10.x.x.1
net0: virtio=86:x:x:x:x:x,bridge=vmbr0
net1: virtio=9A:x:x:x:x:x,bridge=vmbr0
numa: 0
onboot: 1
ostype: l26
scsi0: kvmdatapool:vm-102-disk-0,aio=native,iothread=1,size=20G
scsihw: virtio-scsi-single
searchdomain: int.x.x.x.x
serial0: socket
smbios1: uuid=x
sockets: 2
sshkeys: x
vmgenid: x

What CPU models do you have in the cluster?
3x PVE/CEPH Systems
Intel(R) Xeon(R) W-2295 CPU @ 3.00GHz (1 Socket)
256GB RAM
2x NVME system raid 1
6x Datacenter SSD 2.75TiB OSDs

2x PVE only Systems
Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz (1 Socket)
64GB RAM
2x NVME system raid 1
 
Last edited:
Code:
cat  /etc/pve/qemu-server/102.conf
cpu: host
3x PVE/CEPH Systems
Intel(R) Xeon(R) W-2295 CPU @ 3.00GHz (1 Socket)
256GB RAM
2x NVME system raid 1
6x Datacenter SSD 2.75TiB OSDs
2x PVE only Systems
Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz (1 Socket)
64GB RAM
2x NVME system raid 1
You cannot use host type CPU when you have different physical CPU models, see the CPU Type section in the documentation: https://pve.proxmox.com/pve-docs/chapter-qm.html#qm_cpu

Does the issue also occur when you migrate between nodes with the exact same CPU model?
 
You cannot use host type CPU when you have different physical CPU models, see the CPU Type section in the documentation: https://pve.proxmox.com/pve-docs/chapter-qm.html#qm_cpu

Does the issue also occur when you migrate between nodes with the exact same CPU model?
This is a very good point! our HA groups where miss configured o_O they should stick to the two different type of CPUs, thanks a lot! I am fixing that and checking it accordingly.
 
Hello,

For information, we upgraded our cluster from Proxmox 7.4/Kernel 6.2 to Proxomox 8.0/Kernel 6.2 this week-end, and disabled both KSM and mitigations in the process. Let's see if it's more stable like that.

KSM I can easily live without, we are careful to never overcommit memory. Mitigations I'm not really happy, but we don't run an open cloud, so risks are moderate.

I'll keep you informed if we get another crash with this setup or else in about 10 days I'll declare victory ;)

Regards,
 
Hello,

So it has been over 10 days since we upgraded to Proxmox 8.0, disabled KSM and mitigations, and we didn't have a single crash so far. So I'm starting to feel confident we actually "solved" the problem (even if disabling mitigations is not something I'm very happy about).

Regards,
 
We have hundred of VMs, and from the kernel 6.2 upgrade until the changes of august 6th we had at least one of them (not always the same one of course) freezing every week, usually more like 2-3 freezes per week in total. It might be just luck that we didn't get any in 10 days, but sounds unlikely.
 
Haven't had this issue occur ever since upgrading to kernel 6.2.16-6-pve two weeks ago, with it happening up to multiple times a day before.
Might just be luck or a different package that got upgraded though.

UPDATE: Still going strong almost two weeks later, I'm hopeful
 
Last edited:
Hi,
with or without ksm?
I guess, such an issue doesn't disappear whithout hearing somethings from the developer.

Udo
With KSM and ballooning. Also given the amount of different reasons found in this thread I'm not too sure if the devs know all the reasons for this happening :v
 
Unfortunately, it still happens. And that's why it keeps me from ditching Hyper-V for Proxmox VE. Unfortunately, this error means that I do not yet consider the Proxmox VE solution to be reliable. Hyper-V is rock solid and I've never seen any VM get stuck at 100% CPU usage. I am actively watching this thread and waiting for a breakthrough. Then Proxmox VE will be the target solution.
 
  • Like
Reactions: mronet
Unfortunately, it still happens. And that's why it keeps me from ditching Hyper-V for Proxmox VE. Unfortunately, this error means that I do not yet consider the Proxmox VE solution to be reliable. Hyper-V is rock solid and I've never seen any VM get stuck at 100% CPU usage. I am actively watching this thread and waiting for a breakthrough. Then Proxmox VE will be the target solution.
We are also encountering these problems. As a test, we have disabled ballooning and are currently running our VMs with KSM. Nevertheless, one VM recently went back to 100% CPU utilization today.
 
Since today already the 2nd VM crashed, we have now also deactivated KSM. So Ballooning & KSM deactivated - I am curious about the next days, it ran until today almost 9 days without problems...
 
Unfortunately, it still happens. And that's why it keeps me from ditching Hyper-V for Proxmox VE. Unfortunately, this error means that I do not yet consider the Proxmox VE solution to be reliable. Hyper-V is rock solid and I've never seen any VM get stuck at 100% CPU usage. I am actively watching this thread and waiting for a breakthrough. Then Proxmox VE will be the target solution.
You realize the PVE devs do not work on qemu and the linux kernel right?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!