Kernel 5.4.44 causes system freeze on HP MicroServer Gen8

I'm having the exact same issue with a Ryzen 7 1700 cpu (c-state bug fixed), ZFS and NFS.
I'm getting a freeze in 2 to 10 hours with 5.4.44-1-pve and actually going since 1 week with no issues at all with 5.4.41-1-pve.

service --status-all
[ + ] apparmor
[ - ] console-setup.sh
[ + ] cron
[ + ] dbus
[ - ] hwclock.sh
[ + ] iscsid
[ - ] keyboard-setup.sh
[ + ] kmod
[ - ] lvm2
[ + ] networking
[ - ] nfs-common
[ + ] nfs-kernel-server
[ + ] open-iscsi
[ + ] openbsd-inetd
[ + ] postfix
[ + ] procps
[ + ] rbdmap
[ + ] rpcbind
[ + ] rrdcached
[ - ] rsync
[ + ] rsyslog
[ + ] smartmontools
[ + ] ssh
[ + ] udev
[ - ] x11-common
 
  • Like
Reactions: Chriswiss
Same system freeze issue on my Zotac ZBOX CI327 nano (Intel N3450 CPU). I don‘t use ZFS. Samba and NFS services are running.

Just got one full oops to a bit ago, quite similar. I have some idea where it could comes from - should know tomorrow morning if I was right (i.e., a fileserver in the lab here still works then)
 
  • Like
Reactions: Chriswiss
Just got one full oops to a bit ago, quite similar. I have some idea where it could comes from - should know tomorrow morning if I was right (i.e., a fileserver in the lab here still works then)

In any case, thank you for your prompt attention. ;)
 
There's a updated kernel package available one pvetest, it's pve-kernel-5.4.44-2-pve in version 5.4.44-2 (with the meta package being pve-kernel-5.4 version 6.2-4)

It ran through OK here on the machine we had seen this previously until now, so I'd say it has a decent chance that it got it right.
More testing for such a race condition is definitively needed, we updated more machines here to that one and would also appreciate feedback from people in this thread here.
 
There's a updated kernel package available one pvetest, it's pve-kernel-5.4.44-2-pve in version 5.4.44-2 (with the meta package being pve-kernel-5.4 version 6.2-4)

It ran through OK here on the machine we had seen this previously until now, so I'd say it has a decent chance that it got it right.
More testing for such a race condition is definitively needed, we updated more machines here to that one and would also appreciate feedback from people in this thread here.

Unfortunately the problem is presented on a production machine.
I stay for the moment on the 5.4.41 I hope that others can try.
 
Unfortunately the problem is presented on a production machine.

Are you sure you rebooted in the correct pve-kernel-5.4.44-2-pve with build date from yesterday
Bash:
uname -a
Linux nasi 5.4.44-2-pve #1 SMP PVE 5.4.44-2 (Wed, 01 Jul 2020 16:37:57 +0200) x86_64 GNU/Linux

And how long did it required to trigger?
 
  • Like
Reactions: Chriswiss
Are you sure you rebooted in the correct pve-kernel-5.4.44-2-pve with build date from yesterday
Bash:
uname -a
Linux nasi 5.4.44-2-pve #1 SMP PVE 5.4.44-2 (Wed, 01 Jul 2020 16:37:57 +0200) x86_64 GNU/Linux

And how long did it required to trigger?

Sorry I misspoke.
;)
I didn't try 5.4.44-2. I stay on 5.4.41 because it's a production machine and I can't migrate all VMs to another node for testing.
 
Sorry I misspoke.
;)
I didn't try 5.4.44-2. I stay on 5.4.41 because it's a production machine and I can't migrate all VMs to another node for testing.

OK, and I misread - thanks for clarifying :)
 
  • Like
Reactions: Chriswiss
Hi,

Long-time lurker here, I have been seeing the same issue on 5.4.44-1 (reverted to 5.4.41-1 and it has been stable all week)
1/ With 5.4.44-1 it would kernel panic within 2 days
2/ The machine is running a E5-2660v3 on an Asus X99-E WS with 128GB RAM
3/ ZFS rpool on 2 SATA Samsung 840's ([none] mq-deadline)
4/ ZFS pool on 10x 10TB WD REDs ([mq-deadline] none)
5/ one nvme (ADATA SX8200PNP) running ext4 ([none] mq-deadline)
6/ samba and nfs-kernel-server running on the base system
7/ one ubuntu container and one windows VM
I initially thought it was because I was messing around with the windows VM, changing the GPU and CPU settings after I had upgraded the kernel.
After destroying the old windows VM and rebuilding it, I still had panics.

One of the panics ended with RIP: cgroup_bpf_run_filter_skb, I can dig up the photograph if needed

The odd thing is, I have the same CPU on a different motherboard (an old poweredge) which hasn't seen this issue. (that one doesn't run samba)
 
Hi,

Long-time lurker here, I have been seeing the same issue on 5.4.44-1 (reverted to 5.4.41-1 and it has been stable all week)
1/ With 5.4.44-1 it would kernel panic within 2 days
2/ The machine is running a E5-2660v3 on an Asus X99-E WS with 128GB RAM
3/ ZFS rpool on 2 SATA Samsung 840's ([none] mq-deadline)
4/ ZFS pool on 10x 10TB WD REDs ([mq-deadline] none)
5/ one nvme (ADATA SX8200PNP) running ext4 ([none] mq-deadline)
6/ samba and nfs-kernel-server running on the base system
7/ one ubuntu container and one windows VM
I initially thought it was because I was messing around with the windows VM, changing the GPU and CPU settings after I had upgraded the kernel.
After destroying the old windows VM and rebuilding it, I still had panics.

One of the panics ended with RIP: cgroup_bpf_run_filter_skb, I can dig up the photograph if needed

The odd thing is, I have the same CPU on a different motherboard (an old poweredge) which hasn't seen this issue. (that one doesn't run samba)

You can try the kernel 5.4.44-2 which seems for me stable since yesterday.

There's a updated kernel package available one pvetest, it's pve-kernel-5.4.44-2-pve in version 5.4.44-2 (with the meta package being pve-kernel-5.4 version 6.2-4)

It ran through OK here on the machine we had seen this previously until now, so I'd say it has a decent chance that it got it right.
More testing for such a race condition is definitively needed, we updated more machines here to that one and would also appreciate feedback from people in this thread here.
 
FYI: Due to no issue her for over 40 hours, the positive feedback we got here and the fix got on upstream kernel mailing list, we moved this kernel update to the pve-no-subscription.


Some more in depth information to the underlying issue, for those interested: The actual bug was introduced much earlier than this update, to our current understanding it was actually introduced with the 4.5 Kernel release with the commit bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup") - adding a per-socket CGroup information data structure but not handling reference counting of that shared structure when cloning a socket. But, until recently, this had such a low probability to get hit that in practice it never triggered.
With a recent cgroup memory leak fix (upstream commit 090e28b229af), which was included in the update from v5.4.41 to v5.4.44, this began to matter and it could now actually trigger.
It is actually not directly related with samba or NFS, but having such a file server active increases the chance of this to happen a lot - they are in CGroups on hosts with systemd and have quite some network activity if in use.
 
Last edited:
Thank you for explaining the issue, I've rebooted into the new kernel on both my systems, will report back if the panic reappears.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!