Upgraded to latest v6 high CPU load

robetus · Sep 21, 2021

I've spent a couple hours looking at all the threads complaining of CPU load after an upgrade but haven't found a solution. Here are my servers stats after the upgrade:

Code:

proxmox-ve: 6.4-1 (running kernel: 5.4.140-1-pve)
pve-manager: 6.4-13 (running version: 6.4-13/9f411e79)
pve-kernel-5.4: 6.4-6
pve-kernel-helper: 6.4-6
pve-kernel-5.3: 6.1-6
pve-kernel-5.4.140-1-pve: 5.4.140-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
pve-kernel-4.15: 5.4-14
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-4.13: 5.2-2
pve-kernel-4.15.18-26-pve: 4.15.18-54
pve-kernel-4.13.16-4-pve: 4.13.16-51
pve-kernel-4.13.13-5-pve: 4.13.13-38
pve-kernel-4.13.13-2-pve: 4.13.13-33
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.1.2-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve4~bpo10
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.1.0
libproxmox-backup-qemu0: 1.1.0-1
libpve-access-control: 6.4-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-3
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-3
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.13-2
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.6-1
pve-cluster: 6.4-1
pve-container: 3.3-6
pve-docs: 6.4-2
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-4
pve-firmware: 3.3-1
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.5-pve1~bpo10+1

Proxmox was running great until the upgrade. I read about a person fixing it with a patch, but I can't find anything really relevant. My host has 24 cores with 2 sockets and they are nearly maxed out only running two KVM containers. Any suggestions would be appreciated.

t.lamprecht · Sep 21, 2021

Hi,

What CPU model are you using? lscpu
Are there any long-running processes that consume lots of CPU time? top -b -c -n 1 -o TIME -w200 | head -15
What's the system's pressure stall info? head -n -0 /proc/pressure/*

You could try booting the previous kernel manually on reboot to check if this issue came with the latest kernel version.

robetus · Sep 21, 2021

lscpu:

Code:

@proxmox:~# lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
Address sizes:       46 bits physical, 48 bits virtual
CPU(s):              24
On-line CPU(s) list: 0-23
Thread(s) per core:  2
Core(s) per socket:  6
Socket(s):           2
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               62
Model name:          Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz
Stepping:            4
CPU MHz:             2893.275
CPU max MHz:         3100.0000
CPU min MHz:         1200.0000
BogoMIPS:            5187.93
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            15360K
NUMA node0 CPU(s):   0-23
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm cpuid_fault epb pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts flush_l1d

Yes, there are processes running with: top -b -c -n 1 -o TIME -w200 | head -15

Code:

@proxmox:~# head -n -0 /proc/pressure/*
==> /proc/pressure/cpu <==
some avg10=4.64 avg60=8.87 avg300=7.48 total=4171712515

==> /proc/pressure/io <==
some avg10=0.43 avg60=0.95 avg300=0.31 total=92437322
full avg10=0.02 avg60=0.05 avg300=0.01 total=18857819

==> /proc/pressure/memory <==
some avg10=0.00 avg60=0.00 avg300=0.00 total=0
full avg10=0.00 avg60=0.00 avg300=0.00 total=0

t.lamprecht · Sep 21, 2021

conlustro said:
Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz

hmm, very old CPU... maybe a (old) HW-specific regression

conlustro said:
==> /proc/pressure/cpu <== some avg10=4.64 avg60=8.87 avg300=7.48 total=4171712515

CPU is definititely bogged down.

conlustro said:
Yes, there are processes running with: top -b -c -n 1 -o TIME -w200 | head -15

Yeah, it's expected that there are some processes running on a running system, but can you please post the output then? Else we do not know what produces the actual load...

robetus · Sep 21, 2021

Yeah it is old but really running perfectly before the upgrade.

Code:

@proxmox:~# top -b -c -n 1 -o TIME -w200 | head -15
top - 23:14:00 up  7:35,  1 user,  load average: 20.02, 18.97, 18.75
Tasks: 707 total,   2 running, 697 sleeping,   8 stopped,   0 zombie
%Cpu(s): 55.0 us,  8.4 sy,  0.0 ni, 36.2 id,  0.1 wa,  0.0 hi,  0.3 si,  0.0 st
MiB Mem : 257656.4 total, 229461.4 free,  27891.8 used,    303.1 buff/cache
MiB Swap:   8192.0 total,   8192.0 free,      0.0 used. 227993.2 avail Mem

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
119662 root      20   0   67.1g   6.6g   7336 S  1356   2.6 733:25.04 /usr/bin/kvm -id 110 -name srv1 -no-shutdown -chardev socket,id=qmp,path=/var/run/qemu-server/110.qmp,server,nowait -mon charde+
 11667 root      20   0 5414020   1.3g   6836 S  36.6   0.5 340:25.70 /usr/bin/kvm -id 119 -name srv2 -no-shutdown -chardev socket,id=qmp,path=/var/run/qemu-server/119.qmp,server,nowait -mon cha+
 15538 root      20   0 5406828   1.2g   6848 S  29.3   0.5 222:54.15 /usr/bin/kvm -id 122 -name srv3 -no-shutdown -chardev socket,id=qmp,path=/var/run/qemu-server/122.qmp,server,nowait -+
104332 root      20   0 9610116   5.2g   7184 S  68.3   2.1 118:32.81 /usr/bin/kvm -id 129 -name srv4 -no-shutdown -chardev socket,id=qmp,path=/var/run/qemu-server/129.qmp,server,nowait -m+
  4794 root      20   0  305364  90048   8264 S   0.0   0.0  72:21.71 pvestatd
  4792 root      20   0  306708  88892   5768 S   0.0   0.0  50:36.49 pve-firewall
   424 root       0 -20       0      0      0 S   4.9   0.0  21:20.64 [zvol]
     2 root      20   0       0      0      0 S   0.0   0.0  16:17.25 [kthreadd]

I see that srv1 is %CPU 1356, which is crazy I guess, but I don't see how that's possible.

t.lamprecht · Sep 21, 2021

conlustro said:
I see that srv1 is %CPU 1356, which is crazy I guess, but I don't see how that's possible.

In top 100% means 100% of one core, so, for example, 200% could be two cores maxed out completely or 4 cores each at 50% load.

Ok, four VMs (fyi, there's no such thing as KVM containers) running, each isn't using that much memory but quite some CPU time spent. What's running in the VMs (OS, workload)? And can you please post such a VMs config: qm config 110

In theory, an update inside the VM could have also caused this, coincidently with the last reboot (full restart of the VMs), not really likely but not impossible either.

As mentioned, if you find time I definitively would recommend trying out an older kernel, if that alone helps we could rule out a quite some components already.

robetus · Sep 21, 2021

Code:

@proxmox:~# qm config 110
balloon: 8192
bootdisk: scsi0
cores: 8
description:
ide2: none,media=cdrom
memory: 65536
name: srv1
net0: virtio=56:AF:E8:AA:CB:0C,bridge=vmbr0
numa: 0
onboot: 1
ostype: l26
scsi0: local-zfs:vm-110-disk-1,size=400G
scsihw: virtio-scsi-pci
smbios1: uuid=9a1b7e32-46b6-4fc0-8d40-aee3c7698920
sockets: 2

I'll look into trying an older kernel. Do you have a link for the best instructions on how to boot into an older kernel with proxmox? I read some things on the forum that will probably work.

robetus · Sep 21, 2021

I found this solution for setting the default kernel with grub: https://forum.proxmox.com/threads/select-default-boot-kernel.79582/

These are my kernel options:

Code:

submenu 'Advanced options for Proxmox Virtual Environment GNU/Linux' $menuentry_id_option 'gnulinux-advanced-cudgf983475gfhdjk' {
    menuentry 'Proxmox Virtual Environment GNU/Linux, with Linux 5.4.140-1-pve' --class proxmox --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-5.4.140-1-pve-advanced-cudgf983475gfhdjk' {
    menuentry 'Proxmox Virtual Environment GNU/Linux, with Linux 5.4.34-1-pve' --class proxmox --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-5.4.34-1-pve-advanced-cudgf983475gfhdjk' {
    menuentry 'Proxmox Virtual Environment GNU/Linux, with Linux 5.3.18-3-pve' --class proxmox --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-5.3.18-3-pve-advanced-cudgf983475gfhdjk' {
    menuentry 'Proxmox Virtual Environment GNU/Linux, with Linux 4.15.18-26-pve' --class proxmox --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-4.15.18-26-pve-advanced-cudgf983475gfhdjk' {
    menuentry 'Proxmox Virtual Environment GNU/Linux, with Linux 4.13.16-4-pve' --class proxmox --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-4.13.16-4-pve-advanced-cudgf983475gfhdjk' {
    menuentry 'Proxmox Virtual Environment GNU/Linux, with Linux 4.13.13-5-pve' --class proxmox --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-4.13.13-5-pve-advanced-cudgf983475gfhdjk' {
    menuentry 'Proxmox Virtual Environment GNU/Linux, with Linux 4.13.13-2-pve' --class proxmox --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-4.13.13-2-pve-advanced-cudgf983475gfhdjk' {

Which version should I try to boot into? I was thinking 5.3.18-3-pve

I want to add as well that my nginx reverse proxy stopped working as well after the upgrade. I am able to access the GUI with the IP address but not with the domain set in nginx.

What do you think about me upgrading my CPUs? The board in the server is nowhere near maxed out and I can put something in there like two E5-2680 v2s. That would be a huge upgrade from the two E5-2630 v2s I have in there now. Everything is working but the CPUs now it seems just cannot take the load.

t.lamprecht · Sep 22, 2021

conlustro said:
Which version should I try to boot into? I was thinking 5.3.18-3-pve

I'd rather try 5.4.34-1-pve first, it was released quite some time ago and thus was probably the last kernel that worked out OK for you (assuming the regression is the kernel's fault in the first place):

conlustro said:
menuentry 'Proxmox Virtual Environment GNU/Linux, with Linux 5.4.34-1-pve'

conlustro said:
I want to add as well that my nginx reverse proxy stopped working as well after the upgrade. I am able to access the GUI with the IP address but not with the domain set in nginx.

That is rather weird, do you see any error in the syslog/journal? Are the disks all healthy or are there some IO errors logged in the kernel log (check dmesg)?

Search

Search

Upgraded to latest v6 high CPU load

robetus

Well-Known Member

t.lamprecht

Proxmox Staff Member

robetus

Well-Known Member

t.lamprecht

Proxmox Staff Member

robetus

Well-Known Member

t.lamprecht

Proxmox Staff Member

robetus

Well-Known Member

robetus

Well-Known Member

t.lamprecht

Proxmox Staff Member