BUG: soft lockup

e100

Renowned Member
Nov 6, 2010
1,268
45
88
Columbus, Ohio
ulbuilder.wordpress.com
Hello again everyone, been too long since my last post here.

I have one server randomly locking up for over a month now, now a 2nd server is also having this problem.
Unfortunately I've not captured all of the kernel messages that would help diagnose this but I have a couple screenshots from two servers.

Both servers were installed onto zfs using proxmox iso.
I have tried the following:
* I installed the latest Proxmox updates
* Updated the BIOS
* Disabled zfs disk swap since that is known to cause a deadlock: https://github.com/zfsonlinux/zfs/issues/7734

We did have zram enabled, I removed it after the most recent lockup, waiting to see if that helped or not.

Any suggestions?

pveversion -v
Code:
proxmox-ve: 5.3-1 (running kernel: 4.15.18-10-pve)
pve-manager: 5.3-8 (running version: 5.3-8/2929af8e)
pve-kernel-4.15: 5.3-1
pve-kernel-4.13: 5.2-2
pve-kernel-4.15.18-10-pve: 4.15.18-32
pve-kernel-4.15.18-9-pve: 4.15.18-30
pve-kernel-4.15.17-3-pve: 4.15.17-14
pve-kernel-4.13.16-4-pve: 4.13.16-51
pve-kernel-4.13.16-3-pve: 4.13.16-50
pve-kernel-4.13.16-2-pve: 4.13.16-48
pve-kernel-4.13.13-4-pve: 4.13.13-35
pve-kernel-4.13.13-1-pve: 4.13.13-31
pve-kernel-4.13.8-3-pve: 4.13.8-30
pve-kernel-4.13.8-2-pve: 4.13.8-28
pve-kernel-4.10.15-1-pve: 4.10.15-15
pve-kernel-4.10.11-1-pve: 4.10.11-9
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-3
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-43
libpve-guest-common-perl: 2.0-19
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-36
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-2
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-22
pve-cluster: 5.0-33
pve-container: 2.0-33
pve-docs: 5.3-1
pve-edk2-firmware: 1.20181023-1
pve-firewall: 3.0-17
pve-firmware: 2.0-6
pve-ha-manager: 2.0-6
pve-i18n: 1.0-9
pve-libspice-server1: 0.14.1-1
pve-qemu-kvm: 2.12.1-1
pve-xtermjs: 3.10.1-1
pve-zsync: 1.7-2
qemu-server: 5.0-45
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.12-pve1~bpo1

Server Specs:
Supermicro X10DRI-T, two E5-2620 v4, 128GB RAM
vm1-lockup.png
Supermicro X9DRL-3F/i, two E5-2650 v1, 128GB RAM
vm4.png
 
Hi!
I have one server randomly locking up for over a month now

Seems like a kernel regression? Could you try booting from an older kernel (installed over two months ago)?

If you could capture more kernel logs, it would be really great, with the available info it's a bit hard to point in the direction this issue could come from...
 
If I am not mistaken zfs module was upgraded recently and I have already run zpool upgrade.
I do not think it would be OK to boot up kernel with older zfs module, right?

I went digging in the logs, these are attached as text files.
All of these occurred when we had zfs swap and zram enabled, before we upgraded the BIOS.

First boot after BIOS update:
Feb 8 11:58:08 vm1 kernel: [ 0.000000] DMI: Supermicro X10DRi/X10DRI-T, BIOS 3.1 09/14/2018

After upgrading BIOS issue still occurs but when it happens nothing is written to the logs.
Feb 10th: zfs swap and zram enabled
Feb 10 02:10:32 vm1 qm[17374]: <root@pam> update VM 901: -lock backup
Feb 10 02:11:14 vm1 vzdump[27762]: <root@pam> end task UPID:vm1:00006C73:00C5D66B:5C5FB22A:vzdump::root@pam: OK
--Server locked up about 5:14, errors only on console, nothing in logs
Feb 10 05:28:58 vm1 kernel: [ 0.000000] Linux version 4.15.18-10-pve (build@pve) (gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1)) #1 SMP PVE 4.15.18-32 (Sat, 19 Jan 2019 10:09:37 +0100) ()

Disabled zfs swap on Feb 11th, left zram enabled
Feb 15th:
Feb 15 02:10:49 vm1 qm[2320]: <root@pam> update VM 901: -lock backup
Feb 15 02:11:22 vm1 vzdump[23487]: <root@pam> end task UPID:vm1:00005BC1:02762985:5C6649A9:vzdump::root@pam: OK
-- Server locked up about 2:14, errors only on console, nothing in logs
Feb 15 02:45:37 vm1 kernel: [ 0.000000] Linux version 4.15.18-10-pve (build@pve) (gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1)) #1 SMP PVE 4.15.18-32 (Sat, 19 Jan 2019 10:09:37 +0100) ()

Disabled zram on Feb 15th so server is now running with no swap.

While this problem has occurred at other times it most often happens during or within a few hours of completing a vzdump backup.
 

Attachments

  • feb8.txt
    40.4 KB · Views: 6
  • jan28.txt
    124.2 KB · Views: 0
  • jan30.txt
    544.1 KB · Views: 2
So it seems a bit like either only zram is the issue with you, or both are (separate?) issues, as swap on ZFS is currently really not considered (as of linked issue). You may want create a separate partition and use that directly as swap, if possible. Most of the kernel logs also point at the memory subsystem at a point where it tries to get (swap in?) pages, so yes that's surely the area of problems.

How much memory is roughly available for the OS (total memory - guest memory)?
 
Server has 128GB RAM, Virtual servers all combined are assigned just under 60GB.
We have zfs_arc_max set to 20GB

We have not had any issues since turning off zram on the 15th.
It needs to run stable for at least a month to have confidence that turning off zram fixed anything.

I am considering just running swapless moving forward.
 
OK, then this is surely not an issue.

Sounds reasonable, looking in the kernel git it seems that zRam is a bit unloved there, either it is just fetaure-complete and really stable (a good thing) or it does not sees to much wide-spread use and thus a few bugs with certain setups still hide in it. Our Ubuntu Bionic 4.15 based kernel saw not a single backport (so also the upstream stable kernel trees of 4.14 and 4.19 didn't), further the commits between 4.15 and 5.0-rc8 aren't many, but none did stick out as a obvious fix-candidate for your issue to me...

if you have a good amount of spare resource like you, swapless sounds reasonable.
 
While running swapoff on a couple of nodes the swapoff task would hang, unable to turn swap off on zram devices. They would hang generating task hung messages.

I believe these systems are still running.
Could we get any diagnostic data from these systems that might help discover the source of this problem?
 
Hi @e100
We have the same issues like your screenshots above. On SLES and on Ubuntu, all messages on console are like the same.
Did you fix that problem?
And sometimes the vms freeze during backup, did you also have had this problem too?

best regards,
roman
 
I encountered many different kernel panics within different kind of guest VMs with zram activated on the host. It was to the tune of several a day
Deactivating zram resulted in stable guest VMs. No kernel panic occurred since.

To be noted: activating zram _within_ the guest VM works flawlessly for me.
 
I encountered many different kernel panics within different kind of guest VMs with zram activated on the host. It was to the tune of several a day
Deactivating zram resulted in stable guest VMs. No kernel panic occurred since.

To be noted: activating zram _within_ the guest VM works flawlessly for me.
I have to amend. The crash I encountered may have been due to another issue impacting systems with Celeron Jasper Lake family (N5105) processors. Latest PVE Kernels have largely improved the situation, but not completely solved it.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!