Opt-in Linux 6.8 Kernel for Proxmox VE 8 available on test & no-subscription

Just installed http://download.proxmox.com/debian/...ary-amd64/proxmox-kernel-6.8_6.8.12-1_all.deb and rebooted but:

Code:
[***   ] (1 of 2) Job systemd-udev-settle.service/start running (42s / 3min 1s)
[   56.752464] bnxt_en 0000:c6:00.0: QPLIB: bnxt_re_is_fw_stalled: FW STALL Detected. cmdq[0xe]=0x3 waited (40971 > 40000) msec active 1                    
[   56.764577] bnxt_en 0000:c6:00.0 bnxt_re0: Failed to modify HW QP
[   56.770681] infiniband bnxt_re0: Couldn't change QP1 state to INIT: -110
[   56.777389] infiniband bnxt_re0: Couldn't start port
[   56.782435] bnxt_en 0000:c6:00.0 bnxt_re0: Failed to destroy HW QP
[   56.788644] ------------[ cut here ]------------                                                                                                        
[   56.793270] WARNING: CPU: 10 PID: 1359 at drivers/infiniband/core/cq.c:322 ib_free_cq+0x108/0x150 [ib_core]                                              
[   56.803032] Modules linked in: ipmi_ssif intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd kvm_amd kvm irqbypass crct10dif_pclmul polyval_clmulni po
lyval_generic ghash_clmulni_intel sha256_ssse3 bnxt_re(+) sha1_ssse3 aesni_intel ib_uverbs crypto_simd cryptd ast ib_core acpi_ipmi wmi_bmof rapl pcspkr i2c_a
lgo_bit ipmi_si ccp ipmi_devintf k10temp ptdma ipmi_msghandler joydev input_leds mac_hid vhost_net vhost vhost_iotlb tap msr efi_pstore dmi_sysfs ip_tables x_
tables autofs4 zfs(PO) spl(O) btrfs blake2b_generic xor raid6_pq libcrc32c hid_generic usbkbd usbmouse usbhid hid crc32_pclmul xhci_pci xhci_pci_renesas ahci
myri10ge xhci_hcd bnxt_en libahci dca i2c_piix4 wmi
[   56.862547] CPU: 10 PID: 1359 Comm: (udev-worker) Tainted: P           O       6.8.12-1-pve #1                                                          
[   56.871163] Hardware name: Supermicro AS -1114S-WTRT/H12SSW-NT, BIOS 2.8 01/16/2024                                                                      
[[   56.871179] Code: e8 ed 96 02 00 65 ff 0d 8e f8 fd 3e 0f 85 6f ff ff ff 0f 1f 44 00 00 e9 65 ff ff ff 48 8d 7f 50 e8 2d e7 6b e7 e9 34 ff ff ff <0f> 0b 31
 c0 31 f6 31 ff e9 e6 f5 8a e8 0f 0b e9 7c ff ff ff 80 3d                                                                                                  
**   [   56.871181] RSP: 0018:ffffa4e10f0236c0 EFLAGS: 00010202
 ] (2 of 2) [   56.911132] RAX: 0000000000000002 RBX: 0000000000000001 RCX: 0000000000000000                                                                
Job ifupdown2-pr[   56.919671] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9537c2176000                                                            
e.service/start [   56.928199] RBP: ffffa4e10f023730 R08: 0000000000000000 R09: 0000000000000000                                                            
running (42s / 3[   56.936731] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9537ed400000                                                            
min 1s)
[   56.945263] R13: ffff9537e552b800 R14: 00000000ffffff92 R15: ffff9537dfca1000                                                                            
[   56.953278] FS:  00007ec6d02138c0(0000) GS:ffff95760dd00000(0000) knlGS:0000000000000000                                                                
[   56.961745] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   56.967867] CR2: 00007ec6cf571ec7 CR3: 000000011653a005 CR4: 0000000000f70ef0                                                                            
[   56.975375] PKRU: 55555554
[   56.978474] Call Trace:                                                
[   56.981304]  <TASK>
[   56.983789]  ? show_regs+0x6d/0x80
[   56.987580]  ? __warn+0x89/0x160
[   56.991184]  ? ib_free_cq+0x108/0x150 [ib_core]
[   56.996097]  ? report_bug+0x17e/0x1b0
[   57.000136]  ? handle_bug+0x46/0x90
[   57.003998]  ? exc_invalid_op+0x18/0x80
[   57.008200]  ? asm_exc_invalid_op+0x1b/0x20
[   57.012753]  ? ib_free_cq+0x108/0x150 [ib_core]
[   57.017666]  ? ib_mad_init_device+0x54c/0x880 [ib_core]
[   57.023272]  add_client_context+0x12a/0x1c0 [ib_core]
[   57.028709]  enable_device_and_get+0xe6/0x1e0 [ib_core]
[   57.034317]  ib_register_device+0x506/0x610 [ib_core]
[   57.039752]  ? srso_alias_return_thunk+0x5/0xfbef5
[   57.044913]  ? alloc_port_data+0x59/0x130 [ib_core]
[   57.050159]  ? ib_device_set_netdev+0x160/0x1b0 [ib_core]
[   57.055937]  bnxt_re_probe+0xe56/0x1170 [bnxt_re]
[   57.061018]  ? __pfx_bnxt_re_probe+0x10/0x10 [bnxt_re]
[   57.066536]  auxiliary_bus_probe+0x41/0xa0
[   57.071000]  really_probe+0x1ba/0x420
[   57.075031]  __driver_probe_device+0x7d/0x180
[   57.079759]  driver_probe_device+0x24/0xa0
[   57.084225]  __driver_attach+0xf4/0x1f0
[   57.088433]  ? __pfx___driver_attach+0x10/0x10
[   57.093249]  bus_for_each_dev+0x8d/0xf0
[   57.097460]  driver_attach+0x1e/0x30
[   57.101411]  bus_add_driver+0x14e/0x280
[   57.105623]  driver_register+0x5e/0x130
[   57.109834]  __auxiliary_driver_register+0x73/0xf0
[   57.115000]  ? __pfx_bnxt_re_mod_init+0x10/0x10 [bnxt_re]
[   57.120784]  bnxt_re_mod_init+0x3e/0xff0 [bnxt_re]
[   57.125956]  ? __pfx_bnxt_re_mod_init+0x10/0x10 [bnxt_re]
[   57.131742]  do_one_initcall+0x5e/0x340
[   57.135962]  do_init_module+0x97/0x280
[   57.140093]  load_module+0x21b0/0x2260
[   57.144223]  init_module_from_file+0x96/0x100
[   57.148956]  ? srso_alias_return_thunk+0x5/0xfbef5
[   57.154119]  ? init_module_from_file+0x96/0x100
[   57.159023]  idempotent_init_module+0x11c/0x2b0
[   57.163912]  __x64_sys_finit_module+0x64/0xd0
[   57.168627]  x64_sys_call+0x169c/0x24b0
[   57.172824]  do_syscall_64+0x81/0x170
[   57.176844]  ? srso_alias_return_thunk+0x5/0xfbef5
[   57.181990]  ? do_syscall_64+0x8d/0x170
[   57.186175]  ? srso_alias_return_thunk+0x5/0xfbef5
[   57.191304]  ? syscall_exit_to_user_mode+0x89/0x260
[   57.196514]  ? srso_alias_return_thunk+0x5/0xfbef5
[   57.201625]  ? do_syscall_64+0x8d/0x170
[   57.205772]  ? srso_alias_return_thunk+0x5/0xfbef5
[   57.210867]  ? exc_page_fault+0x94/0x1b0
[   57.215092]  entry_SYSCALL_64_after_hwframe+0x78/0x80
[   57.220451] RIP: 0033:0x7ec6d0921719
[   57.224332] Code: 08 89 e8 5b 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01
f0 ff ff 73 01 c3 48 8b 0d b7 06 0d 00 f7 d8 64 89 01 48
[   57.243675] RSP: 002b:00007fffc89fed98 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[   57.251551] RAX: ffffffffffffffda RBX: 00005a78de998380 RCX: 00007ec6d0921719
[   57.258985] RDX: 0000000000000000 RSI: 00007ec6d0ab4efd RDI: 000000000000000f
[   57.266426] RBP: 00007ec6d0ab4efd R08: 0000000000000000 R09: 00005a78de8cdd60
[   57.273871] R10: 000000000000000f R11: 0000000000000246 R12: 0000000000020000
[   57.281312] R13: 0000000000000000 R14: 00005a78de991350 R15: 00007fffc89fefd0
[   57.288749]  </TASK>
[   57.291238] ---[ end trace 0000000000000000 ]---
[   57.296188] bnxt_en 0000:c6:00.0 bnxt_re0: Free MW failed: 0xffffff92
[   57.302987] infiniband bnxt_re0: Couldn't open port 1
[  *** ] (1 of 2) Job systemd-udev-settle.se…start running (1min 23s / 3min 1s)
[   97.712458] bnxt_en 0000:c6:00.1: QPLIB: bnxt_re_is_fw_stalled: FW STALL Detected. cmdq[0xe]=0x3 waited (40354 > 40000) msec active 1
[   97.724954] bnxt_en 0000:c6:00.1 bnxt_re1: Failed to modify HW QP
[   97.731372] infiniband bnxt_re1: Couldn't change QP1 state to INIT: -110
[   97.738405] infiniband bnxt_re1: Couldn't start port
[   97.743757] bnxt_en 0000:c6:00.1 bnxt_re1: Failed to destroy HW QP
[   97.750456] bnxt_en 0000:c6:00.1 bnxt_re1: Free MW failed: 0xffffff92
[   97.757301] infiniband bnxt_re1: Couldn't open port 1
[  OK  ] Finished systemd-udev-sett…To Complete Device Initialization.

After that the system fails to continue in a sensible way.
 
Last edited:
Hello all!

We have a bunch of old HP Proliant BL460C G7 servers still fully functional. Since they have quite a lot of memory and are attached to a fast storage, we're going to install ProxMox on them for development and homologation purposes.

The resumed specs are:

- 2 Intel Xeon X5650 CPUs (Westmere);
- 96GB ECC DDR3 RAM;
- 2 local 146GB SAS disks attached to a HP P410i RAID controller.

We always create a 1+0 hardware RAID disk using the two SAS disks for local installation and booting - this RAID controller is old and doesn't support HBA mode - and all the other data goes through NFS or CIFS shares to a Netapp storage.

Prior to this, I've been testing a 8.1.3 ProxMox cluster on them for months now, and all is stable and working.

But when I tried to upgrade to PM 8.2, using the new 6.8 kernel, it fails early at the install process. Sometimes it can't get past creating partitions, sometimes it fails at creating the LVM pools, or extracting the base system, or it finishes the install and crashes almost instantly after. Since it's all disk related, my suspicion is that the new kernel is not playing nice with my old RAID controllers.

To confirm that, I took those steps:

- Made a install using the 8.1.4 version (using the proxmox-ve_8.1-2 ISO);
- Configured the no-subscription repository and ran the upgrade;
- The server rebooted, and less than two minutes after completely froze with some disk errors message (they went too fast on the screen and I couldn't capture them).
- Remade the install using the same ISO (proxmox-ve_8.1-2);
- Run the upgrade again, but this time configuring the /etc/apt/preferences.d/proxmox-default-kernel file to pin the kernel to avoid it to upgrade to 6.8;
- Rebooted the server and now it's been stable all day long - and with some VMs copying large files and stressing the CPUs to 100%.

After being almost sure that the problem is related to the new kernel and my old RAID controllers, I'd like to ask if someone could help me:

- To explain how to capture logs that confirm this incompatibility (kernel 6.8, HP P410i RAID controller);
- If they exist, show me settings that can make the new kernel play nice with the controller (HP hardware experts?)
- Telling if some important functionality will be lost upgrading PM (currently at 8.2.4 on the test server) but not upgrading the kernel.

I'm willing to reinstall and reconfigure it as many times as needed if this will help the PM team to figure out , if possible, how to solve this incompatibility.

Thanks for the help and for this great product!
Luis.
 
I don't think it's worth the work.

The hardware is at least 12 years old and consumes so much power that nobody wants to run it productively.

You can Pin the 6.5 Kernel. The 6.5 kernel is still supported even if 6.8 is now the default. I have also pinned the 6.5 kernel on old servers (HPE DL380p Gen8) because there are problems with the onboard network card. This will remain the case until either the servers are replaced or someone fixes the bug for the old NICS.
 
I don't think it's worth the work.

The hardware is at least 12 years old and consumes so much power that nobody wants to run it productively.

You can Pin the 6.5 Kernel. The 6.5 kernel is still supported even if 6.8 is now the default. I have also pinned the 6.5 kernel on old servers (HPE DL380p Gen8) because there are problems with the onboard network card. This will remain the case until either the servers are replaced or someone fixes the bug for the old NICS.
Thanks for the answer. Fair Enough, the hardware is really quite old, so that's why I will use it as a 'throw away' environment for the developers so they can early prototype their systems before moving on to homologation/production servers. Also It would be a waste to just turn then off, since new servers don't come so easily for us :)

And the migration to the new version (8.2.4) with the old kernel seens stable too, so I'll just pin the kernel and call it a day.

Thank you!
 
I have a few servers running ProxMox, and clusters running the 6.8 kernel when fstrim.service runs they are throwing this error:

[644365.525044] DMAR: [DMA Write NO_PASID] Request device [d8:00.0] fault addr 0xd8265000 [fault reason 0x05] PTE Write access is not set
[644365.525142] DMAR: DRHD: handling fault status reg 702

This is not happening under the 6.5 kernel with matching servers and firmware. The device is a Dell BOSS card with the latest firmware both under 6.5 and 6.8 kernels.

It does not appear to affect the server, FSTRIM completes without any errors it its log, but the syslog/dmesg gets these errors every time the fstrim service runs.
 
This is not happening under the 6.5 kernel with matching servers and firmware
Possibly caused by the new default intel_iommu=on in current kernels. This would be inline with your errors which suggest a IOMMU/PCI error.

Try adding to the command line:
Code:
intel_iommu=off
 
Anybody can help identify what's wrong here? I will probably look for a USB serial to better diagnose this. The latest kernel 6.8.12-1 seems to be pretty unstable in my setup. Kernel almost always crashes when that node is doing daily backups. 6.8.x is a hit and miss for me, sometimes it just works, update then it's broken.

I'm back to 6.5 for now (just to test backup stability). However will switch back to 6.8 series once I get that serial console setup sorted out. Have also disabled KSM for now.

In the meantime, appreciate any help if people can find some useful info in this entry... Thanks.
 

Attachments

  • cpu_lockup.txt
    65.6 KB · Views: 4
I have a server with B450 running a Ryzen 1600x (computer is 6 years old).
Proxmox version:
Bash:
root@proxmox:~# pveversion -v
proxmox-ve: 8.2.0 (running kernel: 6.8.12-1-pve)
pve-manager: 8.2.4 (running version: 8.2.4/faa83925c9641325)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.8: 6.8.12-1
proxmox-kernel-6.8.12-1-pve-signed: 6.8.12-1
proxmox-kernel-6.8.8-4-pve-signed: 6.8.8-4
proxmox-kernel-6.8.8-3-pve-signed: 6.8.8-3
proxmox-kernel-6.8.8-2-pve-signed: 6.8.8-2
proxmox-kernel-6.8.4-3-pve-signed: 6.8.4-3
proxmox-kernel-6.8.4-2-pve-signed: 6.8.4-2
proxmox-kernel-6.5.13-6-pve-signed: 6.5.13-6
proxmox-kernel-6.5: 6.5.13-6
proxmox-kernel-6.5.13-5-pve-signed: 6.5.13-5
proxmox-kernel-6.5.11-8-pve-signed: 6.5.11-8
ceph-fuse: 17.2.7-pve2
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx9
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.1.4
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.7
libpve-cluster-perl: 8.0.7
libpve-common-perl: 8.2.2
libpve-guest-common-perl: 5.1.4
libpve-http-server-perl: 5.1.0
libpve-network-perl: 0.9.8
libpve-rs-perl: 0.8.9
libpve-storage-perl: 8.2.3
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.2.7-1
proxmox-backup-file-restore: 3.2.7-1
proxmox-firewall: 0.5.0
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.6
proxmox-widget-toolkit: 4.2.3
pve-cluster: 8.0.7
pve-container: 5.1.12
pve-docs: 8.2.3
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.1
pve-firewall: 5.0.7
pve-firmware: 3.13-1
pve-ha-manager: 4.0.5
pve-i18n: 3.2.2
pve-qemu-kvm: 9.0.2-2
pve-xtermjs: 5.3.0-3
qemu-server: 8.2.4
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.4-pve1

I have intermittent crashes/reboots. I have updated the BIOS to the latest but still face crashes. Ethernet goes down, and HDMI does not provide any output.

Attaching dmesg before crash and syslog.
 

Attachments

  • dmesg_before_crash.log
    114.1 KB · Views: 1
  • syslog_crash.log
    142.6 KB · Views: 2
Thank you guy for this thread, we have the problem on HPE ProLiant DL380 Gen10 server with broadcom NIC ! We resolve it with the doc

"Kernel: Broadcom Infiniband driver issue"​

 
Thank you guy for this thread, we have the problem on HPE ProLiant DL380 Gen10 server with broadcom NIC ! We resolve it with the doc

"Kernel: Broadcom Infiniband driver issue"​

I have the same Issue at my Customers with DL380Gen10. A Simple Firmware update on these NICs solve the Problem also.
 
Strangely servers the most up to date have the problem, not servers out of date...
Gen11 are the newest Servers and you can buy a new Server with older Firmware on your NICs. The Firmware with the fix are from end of June 2024.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!