BUG: soft lockup

e100

Active Member
Nov 6, 2010
1,235
24
38
Columbus, Ohio
ulbuilder.wordpress.com
Hello again everyone, been too long since my last post here.

I have one server randomly locking up for over a month now, now a 2nd server is also having this problem.
Unfortunately I've not captured all of the kernel messages that would help diagnose this but I have a couple screenshots from two servers.

Both servers were installed onto zfs using proxmox iso.
I have tried the following:
* I installed the latest Proxmox updates
* Updated the BIOS
* Disabled zfs disk swap since that is known to cause a deadlock: https://github.com/zfsonlinux/zfs/issues/7734

We did have zram enabled, I removed it after the most recent lockup, waiting to see if that helped or not.

Any suggestions?

pveversion -v
Code:
proxmox-ve: 5.3-1 (running kernel: 4.15.18-10-pve)
pve-manager: 5.3-8 (running version: 5.3-8/2929af8e)
pve-kernel-4.15: 5.3-1
pve-kernel-4.13: 5.2-2
pve-kernel-4.15.18-10-pve: 4.15.18-32
pve-kernel-4.15.18-9-pve: 4.15.18-30
pve-kernel-4.15.17-3-pve: 4.15.17-14
pve-kernel-4.13.16-4-pve: 4.13.16-51
pve-kernel-4.13.16-3-pve: 4.13.16-50
pve-kernel-4.13.16-2-pve: 4.13.16-48
pve-kernel-4.13.13-4-pve: 4.13.13-35
pve-kernel-4.13.13-1-pve: 4.13.13-31
pve-kernel-4.13.8-3-pve: 4.13.8-30
pve-kernel-4.13.8-2-pve: 4.13.8-28
pve-kernel-4.10.15-1-pve: 4.10.15-15
pve-kernel-4.10.11-1-pve: 4.10.11-9
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-3
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-43
libpve-guest-common-perl: 2.0-19
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-36
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-2
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-22
pve-cluster: 5.0-33
pve-container: 2.0-33
pve-docs: 5.3-1
pve-edk2-firmware: 1.20181023-1
pve-firewall: 3.0-17
pve-firmware: 2.0-6
pve-ha-manager: 2.0-6
pve-i18n: 1.0-9
pve-libspice-server1: 0.14.1-1
pve-qemu-kvm: 2.12.1-1
pve-xtermjs: 3.10.1-1
pve-zsync: 1.7-2
qemu-server: 5.0-45
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.12-pve1~bpo1
Server Specs:
Supermicro X10DRI-T, two E5-2620 v4, 128GB RAM
vm1-lockup.png
Supermicro X9DRL-3F/i, two E5-2650 v1, 128GB RAM
vm4.png
 

t.lamprecht

Proxmox Staff Member
Staff member
Jul 28, 2015
1,703
259
103
South Tyrol/Italy
Hi!
I have one server randomly locking up for over a month now
Seems like a kernel regression? Could you try booting from an older kernel (installed over two months ago)?

If you could capture more kernel logs, it would be really great, with the available info it's a bit hard to point in the direction this issue could come from...
 

e100

Active Member
Nov 6, 2010
1,235
24
38
Columbus, Ohio
ulbuilder.wordpress.com
If I am not mistaken zfs module was upgraded recently and I have already run zpool upgrade.
I do not think it would be OK to boot up kernel with older zfs module, right?

I went digging in the logs, these are attached as text files.
All of these occurred when we had zfs swap and zram enabled, before we upgraded the BIOS.

First boot after BIOS update:
Feb 8 11:58:08 vm1 kernel: [ 0.000000] DMI: Supermicro X10DRi/X10DRI-T, BIOS 3.1 09/14/2018

After upgrading BIOS issue still occurs but when it happens nothing is written to the logs.
Feb 10th: zfs swap and zram enabled
Feb 10 02:10:32 vm1 qm[17374]: <root@pam> update VM 901: -lock backup
Feb 10 02:11:14 vm1 vzdump[27762]: <root@pam> end task UPID:vm1:00006C73:00C5D66B:5C5FB22A:vzdump::root@pam: OK
--Server locked up about 5:14, errors only on console, nothing in logs
Feb 10 05:28:58 vm1 kernel: [ 0.000000] Linux version 4.15.18-10-pve (build@pve) (gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1)) #1 SMP PVE 4.15.18-32 (Sat, 19 Jan 2019 10:09:37 +0100) ()

Disabled zfs swap on Feb 11th, left zram enabled
Feb 15th:
Feb 15 02:10:49 vm1 qm[2320]: <root@pam> update VM 901: -lock backup
Feb 15 02:11:22 vm1 vzdump[23487]: <root@pam> end task UPID:vm1:00005BC1:02762985:5C6649A9:vzdump::root@pam: OK
-- Server locked up about 2:14, errors only on console, nothing in logs
Feb 15 02:45:37 vm1 kernel: [ 0.000000] Linux version 4.15.18-10-pve (build@pve) (gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1)) #1 SMP PVE 4.15.18-32 (Sat, 19 Jan 2019 10:09:37 +0100) ()

Disabled zram on Feb 15th so server is now running with no swap.

While this problem has occurred at other times it most often happens during or within a few hours of completing a vzdump backup.
 

Attachments

t.lamprecht

Proxmox Staff Member
Staff member
Jul 28, 2015
1,703
259
103
South Tyrol/Italy
So it seems a bit like either only zram is the issue with you, or both are (separate?) issues, as swap on ZFS is currently really not considered (as of linked issue). You may want create a separate partition and use that directly as swap, if possible. Most of the kernel logs also point at the memory subsystem at a point where it tries to get (swap in?) pages, so yes that's surely the area of problems.

How much memory is roughly available for the OS (total memory - guest memory)?
 

e100

Active Member
Nov 6, 2010
1,235
24
38
Columbus, Ohio
ulbuilder.wordpress.com
Server has 128GB RAM, Virtual servers all combined are assigned just under 60GB.
We have zfs_arc_max set to 20GB

We have not had any issues since turning off zram on the 15th.
It needs to run stable for at least a month to have confidence that turning off zram fixed anything.

I am considering just running swapless moving forward.
 

t.lamprecht

Proxmox Staff Member
Staff member
Jul 28, 2015
1,703
259
103
South Tyrol/Italy
OK, then this is surely not an issue.

Sounds reasonable, looking in the kernel git it seems that zRam is a bit unloved there, either it is just fetaure-complete and really stable (a good thing) or it does not sees to much wide-spread use and thus a few bugs with certain setups still hide in it. Our Ubuntu Bionic 4.15 based kernel saw not a single backport (so also the upstream stable kernel trees of 4.14 and 4.19 didn't), further the commits between 4.15 and 5.0-rc8 aren't many, but none did stick out as a obvious fix-candidate for your issue to me...

if you have a good amount of spare resource like you, swapless sounds reasonable.
 

e100

Active Member
Nov 6, 2010
1,235
24
38
Columbus, Ohio
ulbuilder.wordpress.com
While running swapoff on a couple of nodes the swapoff task would hang, unable to turn swap off on zram devices. They would hang generating task hung messages.

I believe these systems are still running.
Could we get any diagnostic data from these systems that might help discover the source of this problem?
 
Feb 14, 2019
85
2
8
Erlangen, Germany
Hi @e100
We have the same issues like your screenshots above. On SLES and on Ubuntu, all messages on console are like the same.
Did you fix that problem?
And sometimes the vms freeze during backup, did you also have had this problem too?

best regards,
roman
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE and Proxmox Mail Gateway. We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!