Opt-in Linux 6.5 Kernel with ZFS 2.2 for Proxmox VE 8 available on test & no-subscription

Hi,
I have a r620 with broadcom BCM7520 && mellanox connect-x4, it's booting fine with 6.5.3 (I'll test newer other 6.5.x to compare)

(I can't tell if the physical console is working as I only manage them remotly through idrac)

Code:
# lspci |grep -i connect
41:00.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
41:00.1 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]

#lspci |grep BCM5720
01:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
01:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
02:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
02:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
# uname -a
Linux formationkvm3 6.5.3-1-pve #1 SMP PREEMPT_DYNAMIC PMX 6.5.3-1 (2023-10-23T08:03Z) x86_64 GNU/Linux
Our cluster is:

PVE221: Dell PowerEdge R240 Xeon(R) E-2234
PVE222: Dell PowerEdge R610 Xeon(R) CPU L5520 (just checked, I thought it was R620)
PVE223: Dell PowerEdge R6515 AMD EPYC 7313P

Only PVE221 can't boot with kernel 6.5.
No matter if tg3 module is loaded or not and it's not a display issue.


Regards,
 
  • Like
Reactions: Stoiko Ivanov
Hello everyone,

I would also like to add my 5 coins. I have exactly the same problem, with PVE Kernel 6.5 I can no longer boot my Fujitsu S740 (mini PC) system. It just freezes:
Not sure if this has been already suggested here - but in the boot-loader - try removing the 'quiet' flag from the kernel commandline - maybe this helps in finding out where the issue is.
 
Not sure if this has been already suggested here - but in the boot-loader - try removing the 'quiet' flag from the kernel commandline - maybe this helps in finding out where the issue is.
Hello,
thanks for the quick response. I did, the system also seems to be stuck on the network adapter :-(.
boot.jpeg
What can I do now? It is not a productive system and only a backup system at another location.

But if I update the productive system, it may also have problems and that would be very bad for a productive system.
 
Hello,
thanks for the quick response. I did, the system also seems to be stuck on the network adapter :-(.
View attachment 59321
What can I do now? It is not a productive system and only a backup system at another location.

But if I update the productive system, it may also have problems and that would be very bad for a productive system.
Hello,

when I start the server with network it gets stuck in this status. If I start the server without a network, the system remains stuck at this point for 5-10 minutes and then restarts. After plugging in the NIC, I still cannot reach the system. :-(

Thanks for any further help!
 
Hello,
thanks for the quick response. I did, the system also seems to be stuck on the network adapter :-(.
View attachment 59321
What can I do now? It is not a productive system and only a backup system at another location.

But if I update the productive system, it may also have problems and that would be very bad for a productive system.

Ah, that looks very promising, actually. So, this is not a problem with the kernel failing to start. That should be much easier to debug. While I don't know about issues that are specific to Proxmox, this is a situation that you could encounter/debug with any Linux distribution. So, let's look at some of the normal tools you'd use in this kind of situation.

As a first attempt, I would reboot and then edit the command line to so that you remove the quiet part and add systemd.debug-shell instead. If everything in Proxmox is configured how I usually expect it to be on Linux, you should then be able to press CTRL-ALT-F9 when the system appears to be hanging. This brings you to a different text console, where you can start a shell and start poking at things.

I would use commands such as ps -Helf and dmesg to get a general idea of what is trying to run at this point. Then run systemctl, systemctl status, and systemctl status networking to get a better idea of what systemd thinks is happening right now; yes, all three of these commands show different information. Another similar command to try would be systemctl list-jobs. journalctl -xef could also be a good idea, and so is ip a.

Unfortunately, ProxmoxVE doesn't use systemd for managing networking. So, the information that you'll get here is going to be a little limited. You might have to dig down deeper into how Proxmox's implementation of ifup works. And since we don't really have a good theory yet, what is going wrong, I also can't really make any further suggestion what to do past this point.

But hopefully, you'll be able to collect useful data.

If systemd.debug-shell doesn't allow you to make progress, try changing the kernel command line to say systemd.unit=rescue.target instead. That stops at a different point during the boot process. It can be helpful, but it can also be more confusing and require you to know a little bit more about systemd. So, this would be my second choice to try.
 
Last edited:
I confirm DELL with BCM5720 + kernel 6.5 won't boot on PowerEdge T140.

This is a very serious issue, because a lot of DELL servers use this NIC.
Please resolve!!
 
Dell T340 same problem, same NIC

About the VM migrations, they hang when moving from/to any host running kernel 6.5.
In this cluster we have a Dell R620 and a Dell R6515 (running 6.5) and the R240 (running 6.2).

I confirm DELL with BCM5720 + kernel 6.5 won't boot on PowerEdge T140.

If I see this right the only affected Dell Servers with BCM5720 are 14th gen? T140, R240, T340 (the 4 as second digit indicates 13th gen)

because - we have reports of the tg3/BCM5720 NICs working in a R620, R630 by @spirit - and a R630 I have access to also runs fine with kernel 6.5, also I think there were users with working R610 (or similar) , and R650 (or similiar) servers?
Could the affected users maybe share their firmware versions?:

Code:
ethtool -i eno1
driver: tg3
version: 6.5.11-6-pve
firmware-version: FFV22.00.6 bc 5720-v1.39
expansion-rom-version: 
bus-info: 0000:01:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no
 
Affected T140

ethtool -i eno1
driver: tg3
version: 6.2.16-19-pve
firmware-version: 5719-v1.46 NCSI v1.5.33.0
expansion-rom-version:
bus-info: 0000:02:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no
 
If I see this right the only affected Dell Servers with BCM5720 are 14th gen? T140, R240, T340 (the 4 as second digit indicates 13th gen)

because - we have reports of the tg3/BCM5720 NICs working in a R620, R630 by @spirit - and a R630 I have access to also runs fine with kernel 6.5, also I think there were users with working R610 (or similar) , and R650 (or similiar) servers?
Same network card in our Dell R720. The new kernel is working fine.
ethtool -i eno1
driver: tg3
version: 6.5.11-4-pve
firmware-version: FFV21.60.16 bc 5720-v1.39
expansion-rom-version:
bus-info: 0000:01:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no
 
Last edited:
Affected T140

ethtool -i eno1
driver: tg3
version: 6.2.16-19-pve
firmware-version: 5719-v1.46 NCSI v1.5.33.0
expansion-rom-version:
bus-info: 0000:02:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no
not too familiar with the versioning of broadcom nics (in dell servers) - but is the NIC a BCM 5720 - or BCM 5719?
(`lspci -nnk | grep -i broadcom` should provide the information)
 
not too familiar with the versioning of broadcom nics (in dell servers) - but is the NIC a BCM 5720 - or BCM 5719?
(`lspci -nnk | grep -i broadcom` should provide the information)
root@xxx:~# lspci -nnk | grep -i broadcom
01:00.0 RAID bus controller [0104]: Broadcom / LSI MegaRAID SAS-3 3008 [Fury] [1000:005f] (rev 02)
05:00.0 Ethernet controller [0200]: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe [14e4:165f]
05:00.1 Ethernet controller [0200]: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe [14e4:165f]

on a Dell T140 which won't boot with 6.5
 
On Dell PowerEdge T340

root@pve-t340:~# ethtool -i eno1
driver: tg3
version: 6.2.16-19-pve
firmware-version: FFV22.61.8 bc 5720-v1.39
expansion-rom-version:
bus-info: 0000:05:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no

Removing the 'quiet' flag from the kernel commandline don't give me more information. Stop after "loading initial ramdisk"

Disable NIC BCM5720 in bios doesn't change anything.
 
I have some dell r620 with broadcom BCM5720, and they are booting fine

Code:
# lspci -nnk | grep -i broadcom
01:00.0 Ethernet controller [0200]: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe [14e4:165f]
01:00.1 Ethernet controller [0200]: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe [14e4:165f]
02:00.0 Ethernet controller [0200]: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe [14e4:165f]
02:00.1 Ethernet controller [0200]: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe [14e4:165f]

# uname -a
#1 SMP PREEMPT_DYNAMIC PMX 6.5.11-6 (2023-11-29T08:32Z) x86_64 GNU/Linux
 
Booting on 6.5 kernel don't generate any log in /var/log/ifupdown2/ for me.
For my part, on T340 the server hang after "loading initial ramdisk"
Replacing "quiet" by "by systemd.debug-shell=1" don't give me more info in console and CTRL-ALT-F9 give nothing.
 
  • Like
Reactions: zodiac
Booting on 6.5 kernel don't generate any log in /var/log/ifupdown2/ for me.
For my part, on T340 the server hang after "loading initial ramdisk"
Replacing "quiet" by "by systemd.debug-shell=1" don't give me more info in console and CTRL-ALT-F9 give nothing.
Does replacing "quiet" with "break=top" make any difference? That's usually a good way to get into the initramfs, and it should be the very first thing that happens after the kernel initialized itself and started userspace.