VMs with CPU at 100%

Matteo Calorio · Mar 16, 2023

Hi everyone,

we are experiencing a strange problem; since the update, a couple of months ago, to:

occasionally we find some machine with 100% CPU and completely unusable.

We found that just by live-migrating th VM to another node it starts working again, without reset.

Has anyone had similar problems?

Matteo

leesteken · Mar 16, 2023

There have been reports on this forum that migration between different types/generations of CPUs sometimes has problems since recent Proxmox/kernel versions. The work-around appear to be separate pools with the same CPU type. I don't have any experience with that myself.
Also note that kernel 5.19 is not getting any updates; go back to the default 5.15 kernel or use the latest 6.2 kernel for security updates.

beisser · Mar 16, 2023

what hardware is this running on?
there are some issues with N5000/N6000 based machines where VM's go unresponsive with 100% load

There is a quite long Thread here about this:

https://forum.proxmox.com/threads/vm-freezes-irregularly.111494/

if you use different hardware it may not be related though.

Matteo Calorio · Mar 22, 2023

Thanks, yes, we installed a 5.19 kernel just by suggestion in these forums to solve the problem of migrating VMs to nodes with different CPU frequencies:

Live migration problems between higher to lower frequencies CPUs

While this seemed it fixed the problem, it also seems it introduced another one that we didn't have in previous versions of PVE.

To go back to the 5.15 kernel we should understand if the migration problem has been solved its latest versions.

Our nodes are servers with CPUs like:
40 x Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz (2 Sockets)
48 x Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz (2 Sockets)
40 x Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz (2 Sockets)
48 x AMD EPYC 7272 12-Core Processor (2 Sockets)
64 x Intel(R) Xeon(R) Gold 6326 CPUs @ 2.90GHz (2 Sockets)

I read the post you indicated, but at the moment I did not found any clues that could resolve our situation.

Matteo Calorio · Sep 11, 2023

Hi, we still have the problem, here are some messages from a VM that got stuck:

Code:

kernel:[2261815.282065] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [JournalFlusher:870]

Message from syslogd@graylog5 at Sep 11 14:13:07 ...
kernel:[2261815.283198] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [ftdc:872]

Message from syslogd@graylog5 at Sep 11 14:13:07 ...
kernel:[2261815.283226] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [pool-3-thread-1:985]

Thanks,
Matteo

fiona · Sep 11, 2023

Hi,

Matteo Calorio said:

Hi, we still have the problem, here are some messages from a VM that got stuck:

Code:

kernel:[2261815.282065] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [JournalFlusher:870]

Message from syslogd@graylog5 at Sep 11 14:13:07 ...
kernel:[2261815.283198] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [ftdc:872]

Message from syslogd@graylog5 at Sep 11 14:13:07 ...
kernel:[2261815.283226] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [pool-3-thread-1:985]

Thanks,
Matteo

what CPU model have you set in the VM configuration? Please also post the output of pveversion -v from source and target of the migration.

Matteo Calorio said:
Our nodes are servers with CPUs like:
40 x Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz (2 Sockets)
48 x Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz (2 Sockets)
40 x Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz (2 Sockets)
48 x AMD EPYC 7272 12-Core Processor (2 Sockets)
64 x Intel(R) Xeon(R) Gold 6326 CPUs @ 2.90GHz (2 Sockets)

Between which source and target CPUs does the issue happen?

Matteo Calorio · Sep 11, 2023

In that case "host", but normally we use "Default (kvm64)", and the behaviour is the same.

Code:

source:~$ pveversion -v
proxmox-ve: 7.4-1 (running kernel: 5.19.17-2-pve)
pve-manager: 7.4-15 (running version: 7.4-15/a5d2a31e)
pve-kernel-5.15: 7.4-3
pve-kernel-5.19: 7.2-15
pve-kernel-5.4: 6.4-20
pve-kernel-5.19.17-2-pve: 5.19.17-2
pve-kernel-5.19.17-1-pve: 5.19.17-1
pve-kernel-5.15.107-2-pve: 5.15.107-2
pve-kernel-5.4.203-1-pve: 5.4.203-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph-fuse: 14.2.21-1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx4
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4.1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.4-2
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-3
libpve-rs-perl: 0.7.7
libpve-storage-perl: 7.4-3
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.4.2-1
proxmox-backup-file-restore: 2.4.2-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-offline-mirror-helper: 0.5.2
proxmox-widget-toolkit: 3.7.3
pve-cluster: 7.3-3
pve-container: 4.4-6
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-4~bpo11+1
pve-firewall: 4.3-4
pve-firmware: 3.6-5
pve-ha-manager: 3.6.1
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-2
qemu-server: 7.4-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.11-pve1

Code:

destination:~$ pveversion -v
proxmox-ve: 7.4-1 (running kernel: 5.19.17-2-pve)
pve-manager: 7.4-15 (running version: 7.4-15/a5d2a31e)
pve-kernel-5.15: 7.4-3
pve-kernel-5.19: 7.2-15
pve-kernel-5.4: 6.4-20
pve-kernel-5.19.17-2-pve: 5.19.17-2
pve-kernel-5.19.17-1-pve: 5.19.17-1
pve-kernel-5.15.107-2-pve: 5.15.107-2
pve-kernel-5.4.203-1-pve: 5.4.203-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph-fuse: 14.2.21-1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: 0.8.36+pve2
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4.1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.4-2
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-3
libpve-rs-perl: 0.7.7
libpve-storage-perl: 7.4-3
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.4.2-1
proxmox-backup-file-restore: 2.4.2-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-offline-mirror-helper: 0.5.2
proxmox-widget-toolkit: 3.7.3
pve-cluster: 7.3-3
pve-container: 4.4-6
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-4~bpo11+1
pve-firewall: 4.3-4
pve-firmware: 3.6-5
pve-ha-manager: 3.6.1
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-2
qemu-server: 7.4-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.11-pve1

It seems it happens for any sources and destinations.

Thanks,
Matteo

fiona · Sep 11, 2023

Matteo Calorio said:
In that case "host", but normally we use "Default (kvm64)", and the behaviour is the same.

If you have different physical CPUs, you cannot use host. Because that will mean something different on each host, also see the CPU Type section here: https://pve.proxmox.com/pve-docs/chapter-qm.html#qm_cpu
In particular with mixed AMD/Intel you should use kvm64/qemu64 or an abstract x86-64-vX model each host supports.

EDIT: I should clarify that the above applies in case you want to live-migrate between the hosts.

Matteo Calorio said:
Code:

source:~$ pveversion -v proxmox-ve: 7.4-1 (running kernel: 5.19.17-2-pve)

The opt-in kernel 5.19 has been superseded by kernel 6.2 (and previously by 6.1).

I'd try upgrading to a current kernel and fixing the CPU type config to see if the issue persists.

VictorSTS · Sep 11, 2023

@Matteo Calorio There has been a long standing bug with kernels 5.19, 6.1 and 6.2 that has been solved recently and that caused exactly the behavior you are describing. Take a look a this (long!) thread:

https://forum.proxmox.com/threads/vms-freeze-with-100-cpu.127459/post-586503

The patched 6.2 kernel is available in the testing and no-subscription repo and is scheduled to land on the enterprise repo this week if no problems are detected with it. If you are using the enterprise repo I suggest that just wait a few days and then update to the patched kernel 6.2 (proxmox-kernel-6.2.16-12-pve).

fiona · Sep 11, 2023

VictorSTS said:
@Matteo Calorio There has been a long standing bug with kernels 5.19, 6.1 and 6.2 that has been solved recently and that caused exactly the behavior you are describing. Take a look a this (long!) thread:

https://forum.proxmox.com/threads/vms-freeze-with-100-cpu.127459/post-586503

The patched 6.2 kernel is available in the testing and no-subscription repo and is scheduled to land on the enterprise repo this week if no problems are detected with it. If you are using the enterprise repo I suggest that just wait a few days and then update to the patched kernel 6.2 (proxmox-kernel-6.2.16-12-pve).

AFAIK, with that bug you wouldn't have any output in the guest however. Not messages about soft lockups. FYI, the kernel should be available in enterprise repositories too now.

Search

Search

VMs with CPU at 100%

Matteo Calorio

Well-Known Member

leesteken

Distinguished Member

beisser

Active Member

Matteo Calorio

Well-Known Member

Matteo Calorio

Well-Known Member

fiona

Proxmox Staff Member

Matteo Calorio

Well-Known Member

fiona

Proxmox Staff Member

VictorSTS

Renowned Member

fiona

Proxmox Staff Member