node crashing due to soft lockup error - CPU stuck for seconds

mocanub

Active Member
Dec 12, 2018
26
0
41
38
Hi everyone,

I have a problem with a rather new build where the Proxmox node randomly crashes due to kernel panics with soft lockup errors (see attached logs).

CPU: Intel(R) Core(TM) i9-13900K
MOBO: Gigabyte Z790 UD
RAM: 128 GiB of DDR5 memory
STORAGE:
- ZFS in RAID1 for OS based on 2x 256GiB Samsung NMVE SSDs
- ZFS in RAID1 for VM storage on 2x 4TiB Samsung EVO 970
- ZFS in RAID1 for VM BACKUP storage on 2x 4TiB Samsung EVO 970
NICs: Intel card with 10gbps dual port based on the Intel X520-DA2 chip

I've also enabled ZFS caching (11GiB) by adding this line to `/etc/modprobe.d/zfs.conf`:
options zfs zfs_arc_max=11811160064

Here is the output of my pveversion:
Code:
proxmox-ve: 8.0.1 (running kernel: 6.2.16-3-pve)
pve-manager: 8.0.3 (running version: 8.0.3/bbf3993334bfa916)
pve-kernel-6.2: 8.0.2
pve-kernel-6.2.16-3-pve: 6.2.16-3
ceph-fuse: 17.2.6-pve1+3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx2
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-3
libknet1: 1.25-pve1
libproxmox-acme-perl: 1.4.6
libproxmox-backup-qemu0: 1.4.0
libproxmox-rs-perl: 0.3.0
libpve-access-control: 8.0.3
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.0.5
libpve-guest-common-perl: 5.0.3
libpve-http-server-perl: 5.0.3
libpve-rs-perl: 0.8.3
libpve-storage-perl: 8.0.1
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve3
novnc-pve: 1.4.0-2
proxmox-backup-client: 2.99.0-1
proxmox-backup-file-restore: 2.99.0-1
proxmox-kernel-helper: 8.0.2
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.0.5
pve-cluster: 8.0.1
pve-container: 5.0.3
pve-docs: 8.0.3
pve-edk2-firmware: 3.20230228-4
pve-firewall: 5.0.2
pve-firmware: 3.7-1
pve-ha-manager: 4.0.2
pve-i18n: 3.0.4
pve-qemu-kvm: 8.0.2-3
pve-xtermjs: 4.16.0-3
qemu-server: 8.0.6
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.1.12-pve1

Any idea how I can prevent these system crashes from happening?

Thanks in advance,
Bogdan M.
 

Attachments

Please make sure to update your system to the latest version and upload the output of ps auxwf to see which process is hanging
 
Last edited:
Hi,

I've noticed that using the kvm64 CPU type (which was the default one in PVE v7) CPU usage on some VMs on that affected node reached even 215%.

1698300118068.png

I knew that in PVE v8 the default CPU type x86-64-v2-AES so I've switched to that CPU type. Since then the CPU usage no longer exceeds 100% and the crashes due to kernel panics went away.

I don't know if this is something related to this physical CPU but I'm posting this update here hoping to help others in case they will also bump into this issue.

Regards,
Bogdan M.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!