VMs with CPU at 100%

Matteo Calorio

Well-Known Member
Jun 30, 2017
34
0
46
52
Hi everyone,

we are experiencing a strange problem; since the update, a couple of months ago, to:

1678975932771.png

occasionally we find some machine with 100% CPU and completely unusable.

We found that just by live-migrating th VM to another node it starts working again, without reset.

Has anyone had similar problems?

Matteo
 
There have been reports on this forum that migration between different types/generations of CPUs sometimes has problems since recent Proxmox/kernel versions. The work-around appear to be separate pools with the same CPU type. I don't have any experience with that myself.
Also note that kernel 5.19 is not getting any updates; go back to the default 5.15 kernel or use the latest 6.2 kernel for security updates.
 
Thanks, yes, we installed a 5.19 kernel just by suggestion in these forums to solve the problem of migrating VMs to nodes with different CPU frequencies:

Live migration problems between higher to lower frequencies CPUs

While this seemed it fixed the problem, it also seems it introduced another one that we didn't have in previous versions of PVE.

To go back to the 5.15 kernel we should understand if the migration problem has been solved its latest versions.

Our nodes are servers with CPUs like:
40 x Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz (2 Sockets)
48 x Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz (2 Sockets)
40 x Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz (2 Sockets)
48 x AMD EPYC 7272 12-Core Processor (2 Sockets)
64 x Intel(R) Xeon(R) Gold 6326 CPUs @ 2.90GHz (2 Sockets)

I read the post you indicated, but at the moment I did not found any clues that could resolve our situation.
 
Hi, we still have the problem, here are some messages from a VM that got stuck:

Code:
kernel:[2261815.282065] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [JournalFlusher:870]

Message from syslogd@graylog5 at Sep 11 14:13:07 ...
kernel:[2261815.283198] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [ftdc:872]

Message from syslogd@graylog5 at Sep 11 14:13:07 ...
kernel:[2261815.283226] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [pool-3-thread-1:985]

Thanks,
Matteo
 
Hi,
Hi, we still have the problem, here are some messages from a VM that got stuck:

Code:
kernel:[2261815.282065] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [JournalFlusher:870]

Message from syslogd@graylog5 at Sep 11 14:13:07 ...
kernel:[2261815.283198] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [ftdc:872]

Message from syslogd@graylog5 at Sep 11 14:13:07 ...
kernel:[2261815.283226] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [pool-3-thread-1:985]

Thanks,
Matteo
what CPU model have you set in the VM configuration? Please also post the output of pveversion -v from source and target of the migration.

Our nodes are servers with CPUs like:
40 x Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz (2 Sockets)
48 x Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz (2 Sockets)
40 x Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz (2 Sockets)
48 x AMD EPYC 7272 12-Core Processor (2 Sockets)
64 x Intel(R) Xeon(R) Gold 6326 CPUs @ 2.90GHz (2 Sockets)
Between which source and target CPUs does the issue happen?
 
In that case "host", but normally we use "Default (kvm64)", and the behaviour is the same.

Code:
source:~$ pveversion -v
proxmox-ve: 7.4-1 (running kernel: 5.19.17-2-pve)
pve-manager: 7.4-15 (running version: 7.4-15/a5d2a31e)
pve-kernel-5.15: 7.4-3
pve-kernel-5.19: 7.2-15
pve-kernel-5.4: 6.4-20
pve-kernel-5.19.17-2-pve: 5.19.17-2
pve-kernel-5.19.17-1-pve: 5.19.17-1
pve-kernel-5.15.107-2-pve: 5.15.107-2
pve-kernel-5.4.203-1-pve: 5.4.203-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph-fuse: 14.2.21-1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx4
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4.1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.4-2
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-3
libpve-rs-perl: 0.7.7
libpve-storage-perl: 7.4-3
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.4.2-1
proxmox-backup-file-restore: 2.4.2-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-offline-mirror-helper: 0.5.2
proxmox-widget-toolkit: 3.7.3
pve-cluster: 7.3-3
pve-container: 4.4-6
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-4~bpo11+1
pve-firewall: 4.3-4
pve-firmware: 3.6-5
pve-ha-manager: 3.6.1
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-2
qemu-server: 7.4-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.11-pve1

Code:
destination:~$ pveversion -v
proxmox-ve: 7.4-1 (running kernel: 5.19.17-2-pve)
pve-manager: 7.4-15 (running version: 7.4-15/a5d2a31e)
pve-kernel-5.15: 7.4-3
pve-kernel-5.19: 7.2-15
pve-kernel-5.4: 6.4-20
pve-kernel-5.19.17-2-pve: 5.19.17-2
pve-kernel-5.19.17-1-pve: 5.19.17-1
pve-kernel-5.15.107-2-pve: 5.15.107-2
pve-kernel-5.4.203-1-pve: 5.4.203-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph-fuse: 14.2.21-1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: 0.8.36+pve2
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4.1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.4-2
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-3
libpve-rs-perl: 0.7.7
libpve-storage-perl: 7.4-3
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.4.2-1
proxmox-backup-file-restore: 2.4.2-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-offline-mirror-helper: 0.5.2
proxmox-widget-toolkit: 3.7.3
pve-cluster: 7.3-3
pve-container: 4.4-6
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-4~bpo11+1
pve-firewall: 4.3-4
pve-firmware: 3.6-5
pve-ha-manager: 3.6.1
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-2
qemu-server: 7.4-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.11-pve1

It seems it happens for any sources and destinations.

Thanks,
Matteo
 
In that case "host", but normally we use "Default (kvm64)", and the behaviour is the same.
If you have different physical CPUs, you cannot use host. Because that will mean something different on each host, also see the CPU Type section here: https://pve.proxmox.com/pve-docs/chapter-qm.html#qm_cpu
In particular with mixed AMD/Intel you should use kvm64/qemu64 or an abstract x86-64-vX model each host supports.

EDIT: I should clarify that the above applies in case you want to live-migrate between the hosts.

Code:
source:~$ pveversion -v
proxmox-ve: 7.4-1 (running kernel: 5.19.17-2-pve)
The opt-in kernel 5.19 has been superseded by kernel 6.2 (and previously by 6.1).

I'd try upgrading to a current kernel and fixing the CPU type config to see if the issue persists.
 
Last edited:
@Matteo Calorio There has been a long standing bug with kernels 5.19, 6.1 and 6.2 that has been solved recently and that caused exactly the behavior you are describing. Take a look a this (long!) thread:

https://forum.proxmox.com/threads/vms-freeze-with-100-cpu.127459/post-586503

The patched 6.2 kernel is available in the testing and no-subscription repo and is scheduled to land on the enterprise repo this week if no problems are detected with it. If you are using the enterprise repo I suggest that just wait a few days and then update to the patched kernel 6.2 (proxmox-kernel-6.2.16-12-pve).
 
@Matteo Calorio There has been a long standing bug with kernels 5.19, 6.1 and 6.2 that has been solved recently and that caused exactly the behavior you are describing. Take a look a this (long!) thread:

https://forum.proxmox.com/threads/vms-freeze-with-100-cpu.127459/post-586503

The patched 6.2 kernel is available in the testing and no-subscription repo and is scheduled to land on the enterprise repo this week if no problems are detected with it. If you are using the enterprise repo I suggest that just wait a few days and then update to the patched kernel 6.2 (proxmox-kernel-6.2.16-12-pve).
AFAIK, with that bug you wouldn't have any output in the guest however. Not messages about soft lockups. FYI, the kernel should be available in enterprise repositories too now.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!