Kernel 5.4.44 causes system freeze on HP MicroServer Gen8

Killvearn · Jul 1, 2020

I'm having the exact same issue with a Ryzen 7 1700 cpu (c-state bug fixed), ZFS and NFS.
I'm getting a freeze in 2 to 10 hours with 5.4.44-1-pve and actually going since 1 week with no issues at all with 5.4.41-1-pve.

service --status-all
[ + ] apparmor
[ - ] console-setup.sh
[ + ] cron
[ + ] dbus
[ - ] hwclock.sh
[ + ] iscsid
[ - ] keyboard-setup.sh
[ + ] kmod
[ - ] lvm2
[ + ] networking
[ - ] nfs-common
[ + ] nfs-kernel-server
[ + ] open-iscsi
[ + ] openbsd-inetd
[ + ] postfix
[ + ] procps
[ + ] rbdmap
[ + ] rpcbind
[ + ] rrdcached
[ - ] rsync
[ + ] rsyslog
[ + ] smartmontools
[ + ] ssh
[ + ] udev
[ - ] x11-common

Chris Walken · Jul 1, 2020

Same system freeze issue on my Zotac ZBOX CI327 nano (Intel N3450 CPU). I don‘t use ZFS. Samba and NFS services are running.

Kernel Oops in /var/log/messages

t.lamprecht · Jul 1, 2020

Chris Walken said:
Same system freeze issue on my Zotac ZBOX CI327 nano (Intel N3450 CPU). I don‘t use ZFS. Samba and NFS services are running.

Just got one full oops to a bit ago, quite similar. I have some idea where it could comes from - should know tomorrow morning if I was right (i.e., a fileserver in the lab here still works then)

Chriswiss · Jul 1, 2020

t.lamprecht said:
Just got one full oops to a bit ago, quite similar. I have some idea where it could comes from - should know tomorrow morning if I was right (i.e., a fileserver in the lab here still works then)

In any case, thank you for your prompt attention.

t.lamprecht · Jul 2, 2020

There's a updated kernel package available one pvetest, it's pve-kernel-5.4.44-2-pve in version 5.4.44-2 (with the meta package being pve-kernel-5.4 version 6.2-4)

It ran through OK here on the machine we had seen this previously until now, so I'd say it has a decent chance that it got it right.
More testing for such a race condition is definitively needed, we updated more machines here to that one and would also appreciate feedback from people in this thread here.

Chriswiss · Jul 2, 2020

t.lamprecht said:
There's a updated kernel package available one pvetest, it's pve-kernel-5.4.44-2-pve in version 5.4.44-2 (with the meta package being pve-kernel-5.4 version 6.2-4)

It ran through OK here on the machine we had seen this previously until now, so I'd say it has a decent chance that it got it right.
More testing for such a race condition is definitively needed, we updated more machines here to that one and would also appreciate feedback from people in this thread here.

Unfortunately the problem is presented on a production machine.
I stay for the moment on the 5.4.41 I hope that others can try.

t.lamprecht · Jul 2, 2020

Chriswiss said:
Unfortunately the problem is presented on a production machine.

Are you sure you rebooted in the correct pve-kernel-5.4.44-2-pve with build date from yesterday

Bash:

uname -a
Linux nasi 5.4.44-2-pve #1 SMP PVE 5.4.44-2 (Wed, 01 Jul 2020 16:37:57 +0200) x86_64 GNU/Linux

And how long did it required to trigger?

cromatn5 · Jul 2, 2020

Just installed 5.4.22-2.
I will try for some days

Chriswiss · Jul 2, 2020

t.lamprecht said:
Are you sure you rebooted in the correct pve-kernel-5.4.44-2-pve with build date from yesterday

Bash:

uname -a Linux nasi 5.4.44-2-pve #1 SMP PVE 5.4.44-2 (Wed, 01 Jul 2020 16:37:57 +0200) x86_64 GNU/Linux

And how long did it required to trigger?

Sorry I misspoke.

I didn't try 5.4.44-2. I stay on 5.4.41 because it's a production machine and I can't migrate all VMs to another node for testing.

t.lamprecht · Jul 2, 2020

Chriswiss said:
Sorry I misspoke.

I didn't try 5.4.44-2. I stay on 5.4.41 because it's a production machine and I can't migrate all VMs to another node for testing.

OK, and I misread - thanks for clarifying

Chriswiss · Jul 2, 2020

t.lamprecht said:
OK, and I misread - thanks for clarifying

I couldn't wait to try this new kernel.

For the moment reboot the node and it works fine on 5.4.44-2!

dementate · Jul 3, 2020

Hi,

Long-time lurker here, I have been seeing the same issue on 5.4.44-1 (reverted to 5.4.41-1 and it has been stable all week)
1/ With 5.4.44-1 it would kernel panic within 2 days
2/ The machine is running a E5-2660v3 on an Asus X99-E WS with 128GB RAM
3/ ZFS rpool on 2 SATA Samsung 840's ([none] mq-deadline)
4/ ZFS pool on 10x 10TB WD REDs ([mq-deadline] none)
5/ one nvme (ADATA SX8200PNP) running ext4 ([none] mq-deadline)
6/ samba and nfs-kernel-server running on the base system
7/ one ubuntu container and one windows VM
I initially thought it was because I was messing around with the windows VM, changing the GPU and CPU settings after I had upgraded the kernel.
After destroying the old windows VM and rebuilding it, I still had panics.

One of the panics ended with RIP: cgroup_bpf_run_filter_skb, I can dig up the photograph if needed

The odd thing is, I have the same CPU on a different motherboard (an old poweredge) which hasn't seen this issue. (that one doesn't run samba)

Chriswiss · Jul 3, 2020

sdebrass said:
Hi,

Long-time lurker here, I have been seeing the same issue on 5.4.44-1 (reverted to 5.4.41-1 and it has been stable all week)
1/ With 5.4.44-1 it would kernel panic within 2 days
2/ The machine is running a E5-2660v3 on an Asus X99-E WS with 128GB RAM
3/ ZFS rpool on 2 SATA Samsung 840's ([none] mq-deadline)
4/ ZFS pool on 10x 10TB WD REDs ([mq-deadline] none)
5/ one nvme (ADATA SX8200PNP) running ext4 ([none] mq-deadline)
6/ samba and nfs-kernel-server running on the base system
7/ one ubuntu container and one windows VM
I initially thought it was because I was messing around with the windows VM, changing the GPU and CPU settings after I had upgraded the kernel.
After destroying the old windows VM and rebuilding it, I still had panics.

One of the panics ended with RIP: cgroup_bpf_run_filter_skb, I can dig up the photograph if needed

The odd thing is, I have the same CPU on a different motherboard (an old poweredge) which hasn't seen this issue. (that one doesn't run samba)

You can try the kernel 5.4.44-2 which seems for me stable since yesterday.

t.lamprecht said:
There's a updated kernel package available one pvetest, it's pve-kernel-5.4.44-2-pve in version 5.4.44-2 (with the meta package being pve-kernel-5.4 version 6.2-4)

It ran through OK here on the machine we had seen this previously until now, so I'd say it has a decent chance that it got it right.
More testing for such a race condition is definitively needed, we updated more machines here to that one and would also appreciate feedback from people in this thread here.

dementate · Jul 3, 2020

Chriswiss said:
You can try the kernel 5.4.44-2 which seems for me stable since yesterday.

thanks ! after enabling pvetest, would `apt upgrade pve-kernel-5.4` do the trick?

dementate · Jul 3, 2020

sdebrass said:
thanks ! after enabling pvetest, would `apt upgrade pve-kernel-5.4` do the trick?

seems to pull in all the same dependencies as `apt upgrade`, guess i'll do that!

Chriswiss · Jul 3, 2020

sdebrass said:
seems to pull in all the same dependencies as `apt upgrade`, guess i'll do that!

You can if you want to update only the :
- pve-kernel-5.4.44-2-pve
- pve-kernel-5.4
- pve-kernel-helper

t.lamprecht · Jul 3, 2020

FYI: Due to no issue her for over 40 hours, the positive feedback we got here and the fix got on upstream kernel mailing list, we moved this kernel update to the pve-no-subscription.

Some more in depth information to the underlying issue, for those interested: The actual bug was introduced much earlier than this update, to our current understanding it was actually introduced with the 4.5 Kernel release with the commit bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup") - adding a per-socket CGroup information data structure but not handling reference counting of that shared structure when cloning a socket. But, until recently, this had such a low probability to get hit that in practice it never triggered.
With a recent cgroup memory leak fix (upstream commit 090e28b229af), which was included in the update from v5.4.41 to v5.4.44, this began to matter and it could now actually trigger.
It is actually not directly related with samba or NFS, but having such a file server active increases the chance of this to happen a lot - they are in CGroups on hosts with systemd and have quite some network activity if in use.

cromatn5 · Jul 3, 2020

Thanks you for explanation
No crash here from 22h30 min

dementate · Jul 3, 2020

Thank you for explaining the issue, I've rebooted into the new kernel on both my systems, will report back if the panic reappears.

t.lamprecht · Jul 3, 2020

Thanks for all those giving feedback, appreciated!

Kernel 5.4.44 causes system freeze on HP MicroServer Gen8

New Member

Active Member

Proxmox Staff Member

Renowned Member

Proxmox Staff Member

Renowned Member

Proxmox Staff Member

Well-Known Member

Renowned Member

Proxmox Staff Member

Renowned Member

Member

Renowned Member

Member

Member

Renowned Member

Proxmox Staff Member

Well-Known Member

Member

Proxmox Staff Member

We value your privacy