6.2.x Kernel Issues on New Hardware

adamb

Famous Member
Mar 1, 2012
1,323
73
113
We just got one of these bad boys in.

https://www.supermicro.com/en/products/system/mp/2u/sys-241e-tnrttp

Has 4x Intel(R) Xeon(R) Gold 6448H CPU's.

Latest bios and firmware.

When using 6.2.16-4-bpo11-pve the server boots up with the following messages in dmesg.

Code:
[Tue Aug 29 05:37:56 2023] BUG: unable to handle page fault for address: ff4b96a26ee09cff
[Tue Aug 29 05:37:56 2023] #PF: supervisor write access in kernel mode
[Tue Aug 29 05:37:56 2023] #PF: error_code(0x0003) - permissions violation
[Tue Aug 29 05:37:56 2023] PGD 35c38802067 P4D 35c38803067 PUD 112e7a063 PMD 12eeea063 PTE 800000012ee09161
[Tue Aug 29 05:37:56 2023] Oops: 0003 [#1] PREEMPT SMP NOPTI
[Tue Aug 29 05:37:56 2023] CPU: 159 PID: 3737 Comm: z_wr_iss Tainted: P        W  O       6.2.16-4-bpo11-pve #1
[Tue Aug 29 05:37:56 2023] Hardware name: Supermicro Super Server/X13QEH+, BIOS 1.2 03/23/2023
[Tue Aug 29 05:37:56 2023] RIP: 0010:kfpu_begin+0x31/0x70 [zcommon]
[Tue Aug 29 05:37:56 2023] Code: 3f 48 89 e5 fa 0f 1f 44 00 00 48 8b 15 78 8f 00 00 65 8b 05 5d 0b 40 3f 48 98 48 8b 0c c2 0f 1f 44 00 00 b8 ff ff ff ff 89 c2 <0f> c7 29 5d c3 cc cc cc cc 0f 1f 44 00 00 b8 ff ff ff ff 89 c2 0f
[Tue Aug 29 05:37:56 2023] RSP: 0018:ff6d1d32bc497960 EFLAGS: 00010082
[Tue Aug 29 05:37:56 2023] RAX: 00000000ffffffff RBX: ff4b96a395b74000 RCX: ff4b96a26ee07000
[Tue Aug 29 05:37:56 2023] RDX: 00000000ffffffff RSI: ff4b96a395b74000 RDI: ff6d1d32bc497ac0
[Tue Aug 29 05:37:56 2023] RBP: ff6d1d32bc497960 R08: ff6d1d32bc497aa0 R09: 0000000000001000
[Tue Aug 29 05:37:56 2023] R10: ff4b96a25910c000 R11: 000000000000800b R12: ff4b96a395b75000
[Tue Aug 29 05:37:56 2023] R13: ff6d1d32bc497ac0 R14: 0000000000001000 R15: 0000000000000000
[Tue Aug 29 05:37:56 2023] FS:  0000000000000000(0000) GS:ff4b979dbffc0000(0000) knlGS:0000000000000000
[Tue Aug 29 05:37:56 2023] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Tue Aug 29 05:37:56 2023] CR2: ff4b96a26ee09cff CR3: 0000035c36e10002 CR4: 0000000000773ee0
[Tue Aug 29 05:37:56 2023] PKRU: 55555554
[Tue Aug 29 05:37:56 2023] Call Trace:
[Tue Aug 29 05:37:56 2023]  <TASK>
[Tue Aug 29 05:37:56 2023]  fletcher_4_avx512f_native+0x1d/0xb0 [zcommon]
[Tue Aug 29 05:37:56 2023]  abd_fletcher_4_iter+0x6b/0xc0 [zcommon]
[Tue Aug 29 05:37:56 2023]  abd_iterate_func.part.0+0x11a/0x1c0 [zfs]
[Tue Aug 29 05:37:56 2023]  ? __pfx_abd_fletcher_4_iter+0x10/0x10 [zcommon]
[Tue Aug 29 05:37:56 2023]  abd_iterate_func+0x1a/0x20 [zfs]
[Tue Aug 29 05:37:56 2023]  abd_fletcher_4_native+0x7c/0xc0 [zfs]
[Tue Aug 29 05:37:56 2023]  ? vdev_queue_io_to_issue+0x4bd/0xd30 [zfs]
[Tue Aug 29 05:37:56 2023]  ? __pfx_abd_fletcher_4_native+0x10/0x10 [zfs]
[Tue Aug 29 05:37:56 2023]  zio_checksum_compute+0x106/0x560 [zfs]
[Tue Aug 29 05:37:56 2023]  ? lz4_compress_zfs+0x178/0x7b0 [zfs]
[Tue Aug 29 05:37:56 2023]  ? kmem_cache_free+0x1e/0x3b0
[Tue Aug 29 05:37:56 2023]  ? spl_kmem_cache_free+0x142/0x1d0 [spl]
[Tue Aug 29 05:37:56 2023]  zio_checksum_generate+0x42/0x80 [zfs]
[Tue Aug 29 05:37:56 2023]  zio_execute+0x92/0x160 [zfs]
[Tue Aug 29 05:37:56 2023]  taskq_thread+0x29c/0x4d0 [spl]
[Tue Aug 29 05:37:56 2023]  ? __pfx_default_wake_function+0x10/0x10
[Tue Aug 29 05:37:56 2023]  ? __pfx_zio_execute+0x10/0x10 [zfs]
[Tue Aug 29 05:37:56 2023]  ? __pfx_taskq_thread+0x10/0x10 [spl]
[Tue Aug 29 05:37:56 2023]  kthread+0xee/0x120
[Tue Aug 29 05:37:56 2023]  ? __pfx_kthread+0x10/0x10
[Tue Aug 29 05:37:56 2023]  ret_from_fork+0x29/0x50
[Tue Aug 29 05:37:56 2023]  </TASK>
[Tue Aug 29 05:37:56 2023] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache netfs veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter bpfilter sctp ip6_udp_tunnel udp_tunnel dm_round_robin nf_tables bonding tls softdog nfnetlink_log nfnetlink dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common i10nm_edac nfit x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel pmt_crashlog pmt_telemetry ipmi_ssif cmdlinepart sha512_ssse3 aesni_intel crypto_simd intel_sdsi ast pmt_class rndis_host spi_nor drm_shmem_helper cdc_ether cryptd rapl qat_4xxx drm_kms_helper i2c_algo_bit intel_qat usbnet syscopyarea intel_cstate pcspkr efi_pstore sysfillrect joydev mei_me input_leds mii isst_if_mmio sysimgblt idxd isst_if_mbox_pci crc8 mtd mei isst_if_common intel_vsec authenc
[Tue Aug 29 05:37:56 2023]  idxd_bus acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad mac_hid pfr_update pfr_telemetry vhost_net vhost vhost_iotlb tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi drm sunrpc ip_tables x_tables autofs4 zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) btrfs blake2b_generic xor raid6_pq libcrc32c simplefb hid_generic usbmouse usbkbd usbhid hid ixgbe xfrm_algo xhci_pci dca crc32_pclmul xhci_pci_renesas i2c_i801 mdio ahci spi_intel_pci i40e spi_intel i2c_smbus libahci i2c_ismt xhci_hcd wmi pinctrl_emmitsburg
[Tue Aug 29 05:37:56 2023] CR2: ff4b96a26ee09cff
[Tue Aug 29 05:37:56 2023] ---[ end trace 0000000000000000 ]---
[Tue Aug 29 05:37:57 2023] RIP: 0010:kfpu_begin+0x31/0x70 [zcommon]
[Tue Aug 29 05:37:57 2023] Code: 3f 48 89 e5 fa 0f 1f 44 00 00 48 8b 15 78 8f 00 00 65 8b 05 5d 0b 40 3f 48 98 48 8b 0c c2 0f 1f 44 00 00 b8 ff ff ff ff 89 c2 <0f> c7 29 5d c3 cc cc cc cc 0f 1f 44 00 00 b8 ff ff ff ff 89 c2 0f
[Tue Aug 29 05:37:57 2023] RSP: 0018:ff6d1d32bc497960 EFLAGS: 00010082
[Tue Aug 29 05:37:57 2023] RAX: 00000000ffffffff RBX: ff4b96a395b74000 RCX: ff4b96a26ee07000
[Tue Aug 29 05:37:57 2023] RDX: 00000000ffffffff RSI: ff4b96a395b74000 RDI: ff6d1d32bc497ac0
[Tue Aug 29 05:37:57 2023] RBP: ff6d1d32bc497960 R08: ff6d1d32bc497aa0 R09: 0000000000001000
[Tue Aug 29 05:37:57 2023] R10: ff4b96a25910c000 R11: 000000000000800b R12: ff4b96a395b75000
[Tue Aug 29 05:37:57 2023] R13: ff6d1d32bc497ac0 R14: 0000000000001000 R15: 0000000000000000
[Tue Aug 29 05:37:57 2023] FS:  0000000000000000(0000) GS:ff4b979dbffc0000(0000) knlGS:0000000000000000
[Tue Aug 29 05:37:57 2023] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Tue Aug 29 05:37:57 2023] CR2: ff4b96a26ee09cff CR3: 0000035c36e10002 CR4: 0000000000773ee0
[Tue Aug 29 05:37:57 2023] PKRU: 55555554
[Tue Aug 29 05:37:57 2023] note: z_wr_iss[3737] exited with irqs disabled
[Tue Aug 29 05:37:57 2023] note: z_wr_iss[3737] exited with preempt_count 1

The host runs odd and will drop out of the cluster from time to time, VM's will lock up etc.

If we go back to 5.15.108-1-pve all is well, those messages aren't in dmesg and the front end is rock solid.

root@ccsprogmiscrit1:~# pveversion
pve-manager/7.4-16/0f39f621 (running kernel: 5.15.108-1-pve)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!