Proxmox Kernel 6.8.12-2 Freezes (again)

Code:
No CX3:
Jun 12 07:04:51 pve kernel: Linux version 6.8.4-2-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.8.4-2 (2024-04-10T17:36Z) ()
Jun 12 07:04:51 pve kernel: DMI: To Be Filled By O.E.M. X570D4U-2L2T/X570D4U-2L2T, BIOS P1.70 09/15/2022

CX3 installed together with new B650:
Jul 06 21:54:50 pve kernel: Linux version 6.8.8-2-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.8.8-2 (2024-06-24T09:00Z) ()
Jul 06 21:54:50 pve kernel: DMI: To Be Filled By O.E.M. B650D4U-2L2T/BCM/B650D4U-2L2T/BCM, BIOS 4.09 10/02/2023
Jul 06 21:54:50 pve kernel: mlx4_en: Mellanox ConnectX HCA Ethernet driver v4.0-0

First reboot: (sporadic)
Aug 26 07:43:07 pve kernel: Linux version 6.8.8-4-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.8.8-4 (2024-07-26T11:15Z) ()
Aug 26 07:43:07 pve kernel: DMI: To Be Filled By O.E.M. B650D4U-2L2T/BCM/B650D4U-2L2T/BCM, BIOS 4.09 10/02/2023
Aug 26 07:43:07 pve kernel: mlx4_en: Mellanox ConnectX HCA Ethernet driver v4.0-0
Second reboot: (sporadic)
Aug 26 16:39:58 pve kernel: Linux version 6.8.8-4-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.8.8-4 (2024-07-26T11:15Z) ()
Aug 26 16:39:58 pve kernel: DMI: To Be Filled By O.E.M. B650D4U-2L2T/BCM/B650D4U-2L2T/BCM, BIOS 4.09 10/02/2023
Aug 26 16:39:58 pve kernel: mlx4_en: Mellanox ConnectX HCA Ethernet driver v4.0-0
Third reboot: (sporadic)
Aug 26 16:43:09 pve kernel: Linux version 6.8.8-4-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.8.8-4 (2024-07-26T11:15Z) ()
Aug 26 16:43:09 pve kernel: DMI: To Be Filled By O.E.M. B650D4U-2L2T/BCM/B650D4U-2L2T/BCM, BIOS 4.09 10/02/2023
Aug 26 16:43:09 pve kernel: mlx4_en: Mellanox ConnectX HCA Ethernet driver v4.0-0
Fourth reboot: (sporadic)
Aug 26 16:45:53 pve kernel: Linux version 6.8.8-4-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.8.8-4 (2024-07-26T11:15Z) ()
Aug 26 16:45:53 pve kernel: DMI: To Be Filled By O.E.M. B650D4U-2L2T/BCM/B650D4U-2L2T/BCM, BIOS 4.09 10/02/2023
Aug 26 16:45:53 pve kernel: mlx4_en: Mellanox ConnectX HCA Ethernet driver v4.0-0
Sixth reboot: (sporadic)
Aug 29 15:04:23 pve kernel: Linux version 6.8.12-1-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-1 (2024-08-05T16:17Z) ()
Aug 29 15:04:23 pve kernel: DMI: To Be Filled By O.E.M. B650D4U-2L2T/BCM/B650D4U-2L2T/BCM, BIOS 4.09 10/02/2023
Aug 29 15:04:23 pve kernel: mlx4_en: Mellanox ConnectX HCA Ethernet driver v4.0-0
Switched back to X570:
Aug 29 15:43:08 pve kernel: Linux version 6.8.12-1-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-1 (2024-08-05T16:17Z) ()
Aug 29 15:43:08 pve kernel: DMI: To Be Filled By O.E.M. X570D4U-2L2T/X570D4U-2L2T, BIOS P1.70 09/15/2022
Aug 29 15:43:08 pve kernel: mlx4_en: Mellanox ConnectX HCA Ethernet driver v4.0-0
Reboot because of SAS reconfiguration (planned):
Aug 31 17:02:27 pve kernel: Linux version 6.8.12-1-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-1 (2024-08-05T16:17Z) ()
Aug 31 17:02:27 pve kernel: DMI: To Be Filled By O.E.M. X570D4U-2L2T/X570D4U-2L2T, BIOS P1.70 09/15/2022
Aug 31 17:02:27 pve kernel: mlx4_en: Mellanox ConnectX HCA Ethernet driver v4.0-0
Running fine since then!
 
  • Like
Reactions: Decco1337
Seems I found the issue just for the case I mentioned in the last messages. I do not why but it seems the BMC tooks old settings. Bond was configured for the BMC and eth1 was activated. This causes issues if you use external NIC on B650. Now I can migrate the VMs without any issues again.

But all in all solves that issue but not the issue that other hosts reboots.
 
in my case it was a memory problem, remember to test your equipment properly (memtest86)... (solved)

Sometimes this helps, also did this with servers on my side. But a wide range of servers here have these issues and memtest was always green.
 
Hi,

Is anyone still experiencing this problem (random reboots)?

Since a while it is a bit quite. I am still using the kernel 6.11 and the following boot parameters:

Code:
kernel.softlockup_panic=0 pcie_port_pm=off pcie_aspm.policy=performance libata.force=noncq nox2apic

I also used ASRock Rack B650D4U in the most servers, some servers with that board died so I needed to change the board. Since these boards are changed I faced no further reboots for a week.
 
Last edited:
  • Like
Reactions: johnsonmelof
Since a while it is a bit quite. I am still using the kernel 6.11 and the following boot parameters:

Code:
kernel.softlockup_panic=0 pcie_port_pm=off pcie_aspm.policy=performance libata.force=noncq nox2apic

I also used ASRock Rack B650D4U in the most servers, some servers with that board died so I needed to change the board. Since these boards are changed I faced no further reboots for a week.

Thanks Deco I have a ChangWang N305 board and after I updated the BIOS it seems to be more stable. Got the bios via the vendor HUNSN via an Amazon chat. I also applied the following blog in grub: https://www.thomas-krenn.com/de/wiki/Known_Issues_Proxmox_VE_8.2
Code:
GRUB_CMDLINE_LINUX_DEFAULT="quiet mitigations=off consoleblank=15 pcie_port_pm=off libata.force=noncq intel_iommu=on iommu=pt"
and kernel Linux pve 6.5.13-6-pve

I asked chatGPT about your grub does the softlockup make a difference, where you able to read output via the HDMI port?

How did you enable kernel 6.11?

Code:
root@pve:~# apt install pve-kernel-
pve-kernel-6.1           pve-kernel-6.2           pve-kernel-6.2.16-2-pve  pve-kernel-6.2.16-4-pve  pve-kernel-helper
pve-kernel-6.1.10-1-pve  pve-kernel-6.2.16-1-pve  pve-kernel-6.2.16-3-pve  pve-kernel-6.2.16-5-pve  pve-kernel-libc-dev




1729849595364.png
 
in my case it was a memory problem, remember to test your equipment properly (memtest86)... (solved)
I actually think, but not sure, I had as well a memory problem. As I needed to "do something" while waiting 4 weeks for a new barebone, i striped the machine, removed CPU cooler, clean all residue with isopropanol alcohol and placed new quality rated cooling paste. Took out the memory cleaned the connectors with isopranol, reinserted it and clear CMOS forcing bios to re-identify it....

Strange enough, this seems actually to solve the issue for me. I should off course do a memtest and I'm planning for it. I believe the clearing CMOS part was what actually fixed it. Now I have a spare i N305 barebone on the shelf at home, feeling a bit more secure... (maybe will try to setup HA).
 
Last edited:
I have two Zen 5 Systems... one of that isn't freezing at all for over 80 days. My other system ist freezing on a daily base on the 6.8.12-2 kernel.
How can it be that the Proxmox team remains silent for so long about such a critical problem that could affect an entire CPU platform? This is really an absurdity
 
Didn't know this was an issue until this morning. System froze up, no response. Commands dont work from the machine. Pressing the power button cleared the screen and left me with this until I forced it to shut down: "[FAILED] Failed to start systemd-journald.service - Journal Service." Logs unhelpful. No clue how to look it up, was about to make a post here. Found that this was near the top here. Checked version of kernel, yep, 6.8.12-2. Someone said it's a shared issue on epyc. It's running with a 7xx2 cpu, so yeah that fits. Glad to see I was forced into awareness about this issue after 6.8.12-3 was released. Hopefully it fixes it, but I cannot understand the patch notes. Is there any knowledge about if this issue is fixed with it?
 
Since a while it is a bit quite. I am still using the kernel 6.11 and the following boot parameters:

Code:
kernel.softlockup_panic=0 pcie_port_pm=off pcie_aspm.policy=performance libata.force=noncq nox2apic

I also used ASRock Rack B650D4U in the most servers, some servers with that board died so I needed to change the board. Since these boards are changed I faced no further reboots for a week.


Hey folks! An update from my side!

Since I use the Ubuntu kernel 6.11 and replaced the boards to from ASRock Rack B650D4U to Supermicro H13SAE-MF for the AM5 platform, I had no further issues.

Seems the whole thing is a software and hardware issue. For the AM5 platform I recommend to use Supermicro H13SAE-MF for servers as they are working really stable. ASRock Rack boards can initiate a reboot and can die after few months. So they will reboot and reboot and one day you won't be able to start the server anymore. Nothing will help, no CMOS clear, no BIOS or IPMI update. You need to replace the board.
 
Just download the Kernel packages from an Ubuntu mirror (picked mirror.plusserver.com randomly):
It's ok for testing. For productive hosts you should think about own apt repos where you can store these packages. Or wait and hope that Proxmox will provide 6.11 kernels soon.

There is a package dependency for wireless-regdb. It's exists also in the Debian repos. Just install it before with apt.
Than install the deb packages directly.

You should load br_netfilter module explicitly:
Just add a line "br_netfilter" to "/etc/modules-load.d/br_netfilter.conf"

Because pve-firewall tries to read /proc/sys/net/bridge/bridge-nf-call-iptables which doesn't exists if the module isn't loaded.
Seems Proxmox patched this in their own version of the Ubuntu kernel. But the workaround is simple enough. :)

update-grub or the proxmox-boot-tool should set the 6.11 kernel as default because it's the highest number. So just reboot and hope :)

Of course you could also add the whole Ubuntu repo to your system. But that needs also some apt pinning configuration. Otherwise you probaly will fuck up your system if you start to mix all packages from Debian, Ubuntu and Proxmox. :p
I kind of agree with the Approach. I hit the same Kernel Panic at boot Time with an old ASUS P9D WS + Intel Xeon E3-1245 v3 CPU.

However, 2 Things to note:
a. I would NOT trust Ubuntu's ZFS Packaging from 100km Away ... They screwed up pretty badly once and that caused major Data Loss. Plus the Boot Time ZFS Generator had a BUG that I reported and no Action taken over several Months. So yeah, just install ZFS from Source OR

b. Install the Debian Kernel 6.1.x (or Debian Backports Kernel 6.10.x or 6.11.x)



Note however that, even with ZFS 2.2.6, Kernel 6.11 is NOT officially supported. You can see some Stack Traces in dmesg. I don't think they are too bad, but just be aware of it.

EDIT 1:

And since zfs-dkms is NOT provided by Proxmox Repositories, I decided to install everything from the Bookworm Repositories (zfsutils-linux, zfs-zed, ...).

I have a bad feeling about upgrades though ...
 
Great!


How many days does it usually take to be added to Enterprise Repo?
Hi,

Regarding this, I saw today the new proxmox-kernel 6.8.12-4. Has anyone already tested it?

My current version is 6.8.8-4, and I haven't updated it yet because it's stable, but as I saw some persons handling some issues on the latest versions, I'm afraid to mess up my environment.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!