BUG: soft lockup

e100 · Feb 15, 2019

Hello again everyone, been too long since my last post here.

I have one server randomly locking up for over a month now, now a 2nd server is also having this problem.
Unfortunately I've not captured all of the kernel messages that would help diagnose this but I have a couple screenshots from two servers.

Both servers were installed onto zfs using proxmox iso.
I have tried the following:
* I installed the latest Proxmox updates
* Updated the BIOS
* Disabled zfs disk swap since that is known to cause a deadlock: https://github.com/zfsonlinux/zfs/issues/7734

We did have zram enabled, I removed it after the most recent lockup, waiting to see if that helped or not.

Any suggestions?

pveversion -v

Code:

proxmox-ve: 5.3-1 (running kernel: 4.15.18-10-pve)
pve-manager: 5.3-8 (running version: 5.3-8/2929af8e)
pve-kernel-4.15: 5.3-1
pve-kernel-4.13: 5.2-2
pve-kernel-4.15.18-10-pve: 4.15.18-32
pve-kernel-4.15.18-9-pve: 4.15.18-30
pve-kernel-4.15.17-3-pve: 4.15.17-14
pve-kernel-4.13.16-4-pve: 4.13.16-51
pve-kernel-4.13.16-3-pve: 4.13.16-50
pve-kernel-4.13.16-2-pve: 4.13.16-48
pve-kernel-4.13.13-4-pve: 4.13.13-35
pve-kernel-4.13.13-1-pve: 4.13.13-31
pve-kernel-4.13.8-3-pve: 4.13.8-30
pve-kernel-4.13.8-2-pve: 4.13.8-28
pve-kernel-4.10.15-1-pve: 4.10.15-15
pve-kernel-4.10.11-1-pve: 4.10.11-9
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-3
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-43
libpve-guest-common-perl: 2.0-19
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-36
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-2
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-22
pve-cluster: 5.0-33
pve-container: 2.0-33
pve-docs: 5.3-1
pve-edk2-firmware: 1.20181023-1
pve-firewall: 3.0-17
pve-firmware: 2.0-6
pve-ha-manager: 2.0-6
pve-i18n: 1.0-9
pve-libspice-server1: 0.14.1-1
pve-qemu-kvm: 2.12.1-1
pve-xtermjs: 3.10.1-1
pve-zsync: 1.7-2
qemu-server: 5.0-45
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.12-pve1~bpo1

Server Specs:
Supermicro X10DRI-T, two E5-2620 v4, 128GB RAM

Supermicro X9DRL-3F/i, two E5-2650 v1, 128GB RAM

t.lamprecht · Feb 19, 2019

Hi!

e100 said:
I have one server randomly locking up for over a month now

Seems like a kernel regression? Could you try booting from an older kernel (installed over two months ago)?

If you could capture more kernel logs, it would be really great, with the available info it's a bit hard to point in the direction this issue could come from...

e100 · Feb 19, 2019

If I am not mistaken zfs module was upgraded recently and I have already run zpool upgrade.
I do not think it would be OK to boot up kernel with older zfs module, right?

I went digging in the logs, these are attached as text files.
All of these occurred when we had zfs swap and zram enabled, before we upgraded the BIOS.

First boot after BIOS update:
Feb 8 11:58:08 vm1 kernel: [ 0.000000] DMI: Supermicro X10DRi/X10DRI-T, BIOS 3.1 09/14/2018

After upgrading BIOS issue still occurs but when it happens nothing is written to the logs.
Feb 10th: zfs swap and zram enabled
Feb 10 02:10:32 vm1 qm[17374]: <root@pam> update VM 901: -lock backup
Feb 10 02:11:14 vm1 vzdump[27762]: <root@pam> end task UPID:vm1:00006C73:00C5D66B:5C5FB22A:vzdump::root@pam: OK
--Server locked up about 5:14, errors only on console, nothing in logs
Feb 10 05:28:58 vm1 kernel: [ 0.000000] Linux version 4.15.18-10-pve (build@pve) (gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1)) #1 SMP PVE 4.15.18-32 (Sat, 19 Jan 2019 10:09:37 +0100) ()

Disabled zfs swap on Feb 11th, left zram enabled
Feb 15th:
Feb 15 02:10:49 vm1 qm[2320]: <root@pam> update VM 901: -lock backup
Feb 15 02:11:22 vm1 vzdump[23487]: <root@pam> end task UPID:vm1:00005BC1:02762985:5C6649A9:vzdump::root@pam: OK
-- Server locked up about 2:14, errors only on console, nothing in logs
Feb 15 02:45:37 vm1 kernel: [ 0.000000] Linux version 4.15.18-10-pve (build@pve) (gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1)) #1 SMP PVE 4.15.18-32 (Sat, 19 Jan 2019 10:09:37 +0100) ()

Disabled zram on Feb 15th so server is now running with no swap.

While this problem has occurred at other times it most often happens during or within a few hours of completing a vzdump backup.

t.lamprecht · Feb 25, 2019

So it seems a bit like either only zram is the issue with you, or both are (separate?) issues, as swap on ZFS is currently really not considered (as of linked issue). You may want create a separate partition and use that directly as swap, if possible. Most of the kernel logs also point at the memory subsystem at a point where it tries to get (swap in?) pages, so yes that's surely the area of problems.

How much memory is roughly available for the OS (total memory - guest memory)?

e100 · Feb 25, 2019

Server has 128GB RAM, Virtual servers all combined are assigned just under 60GB.
We have zfs_arc_max set to 20GB

We have not had any issues since turning off zram on the 15th.
It needs to run stable for at least a month to have confidence that turning off zram fixed anything.

I am considering just running swapless moving forward.

t.lamprecht · Feb 25, 2019

OK, then this is surely not an issue.

Sounds reasonable, looking in the kernel git it seems that zRam is a bit unloved there, either it is just fetaure-complete and really stable (a good thing) or it does not sees to much wide-spread use and thus a few bugs with certain setups still hide in it. Our Ubuntu Bionic 4.15 based kernel saw not a single backport (so also the upstream stable kernel trees of 4.14 and 4.19 didn't), further the commits between 4.15 and 5.0-rc8 aren't many, but none did stick out as a obvious fix-candidate for your issue to me...

if you have a good amount of spare resource like you, swapless sounds reasonable.

e100 · Mar 21, 2019

We have not had any issues since turning off zRam over a month ago.

t.lamprecht · Mar 22, 2019

Thanks for reporting back! I added a warning on our zRam article at: https://pve.proxmox.com/wiki/Zram for now

e100 · Mar 23, 2019

While running swapoff on a couple of nodes the swapoff task would hang, unable to turn swap off on zram devices. They would hang generating task hung messages.

I believe these systems are still running.
Could we get any diagnostic data from these systems that might help discover the source of this problem?

Romsch · Sep 10, 2019

Hi @e100
We have the same issues like your screenshots above. On SLES and on Ubuntu, all messages on console are like the same.
Did you fix that problem?
And sometimes the vms freeze during backup, did you also have had this problem too?

best regards,
roman

Arduous · Aug 23, 2022

I encountered many different kernel panics within different kind of guest VMs with zram activated on the host. It was to the tune of several a day
Deactivating zram resulted in stable guest VMs. No kernel panic occurred since.

To be noted: activating zram _within_ the guest VM works flawlessly for me.

Arduous · May 9, 2023

Arduous said:
I encountered many different kernel panics within different kind of guest VMs with zram activated on the host. It was to the tune of several a day
Deactivating zram resulted in stable guest VMs. No kernel panic occurred since.

To be noted: activating zram _within_ the guest VM works flawlessly for me.

I have to amend. The crash I encountered may have been due to another issue impacting systems with Celeron Jasper Lake family (N5105) processors. Latest PVE Kernels have largely improved the situation, but not completely solved it.

Search

Search

BUG: soft lockup

e100

Renowned Member

t.lamprecht

Proxmox Staff Member

e100

Renowned Member

Attachments

t.lamprecht

Proxmox Staff Member

e100

Renowned Member

t.lamprecht

Proxmox Staff Member

e100

Renowned Member

t.lamprecht

Proxmox Staff Member

e100

Renowned Member

Romsch

Well-Known Member

Arduous

New Member

Arduous

New Member