6.2.x Kernel Issues on New Hardware

adamb

Famous Member
Mar 1, 2012
1,329
77
113
We just got one of these bad boys in.

https://www.supermicro.com/en/products/system/mp/2u/sys-241e-tnrttp

Has 4x Intel(R) Xeon(R) Gold 6448H CPU's.

Latest bios and firmware.

When using 6.2.16-4-bpo11-pve the server boots up with the following messages in dmesg.

Code:
[Tue Aug 29 05:37:56 2023] BUG: unable to handle page fault for address: ff4b96a26ee09cff
[Tue Aug 29 05:37:56 2023] #PF: supervisor write access in kernel mode
[Tue Aug 29 05:37:56 2023] #PF: error_code(0x0003) - permissions violation
[Tue Aug 29 05:37:56 2023] PGD 35c38802067 P4D 35c38803067 PUD 112e7a063 PMD 12eeea063 PTE 800000012ee09161
[Tue Aug 29 05:37:56 2023] Oops: 0003 [#1] PREEMPT SMP NOPTI
[Tue Aug 29 05:37:56 2023] CPU: 159 PID: 3737 Comm: z_wr_iss Tainted: P        W  O       6.2.16-4-bpo11-pve #1
[Tue Aug 29 05:37:56 2023] Hardware name: Supermicro Super Server/X13QEH+, BIOS 1.2 03/23/2023
[Tue Aug 29 05:37:56 2023] RIP: 0010:kfpu_begin+0x31/0x70 [zcommon]
[Tue Aug 29 05:37:56 2023] Code: 3f 48 89 e5 fa 0f 1f 44 00 00 48 8b 15 78 8f 00 00 65 8b 05 5d 0b 40 3f 48 98 48 8b 0c c2 0f 1f 44 00 00 b8 ff ff ff ff 89 c2 <0f> c7 29 5d c3 cc cc cc cc 0f 1f 44 00 00 b8 ff ff ff ff 89 c2 0f
[Tue Aug 29 05:37:56 2023] RSP: 0018:ff6d1d32bc497960 EFLAGS: 00010082
[Tue Aug 29 05:37:56 2023] RAX: 00000000ffffffff RBX: ff4b96a395b74000 RCX: ff4b96a26ee07000
[Tue Aug 29 05:37:56 2023] RDX: 00000000ffffffff RSI: ff4b96a395b74000 RDI: ff6d1d32bc497ac0
[Tue Aug 29 05:37:56 2023] RBP: ff6d1d32bc497960 R08: ff6d1d32bc497aa0 R09: 0000000000001000
[Tue Aug 29 05:37:56 2023] R10: ff4b96a25910c000 R11: 000000000000800b R12: ff4b96a395b75000
[Tue Aug 29 05:37:56 2023] R13: ff6d1d32bc497ac0 R14: 0000000000001000 R15: 0000000000000000
[Tue Aug 29 05:37:56 2023] FS:  0000000000000000(0000) GS:ff4b979dbffc0000(0000) knlGS:0000000000000000
[Tue Aug 29 05:37:56 2023] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Tue Aug 29 05:37:56 2023] CR2: ff4b96a26ee09cff CR3: 0000035c36e10002 CR4: 0000000000773ee0
[Tue Aug 29 05:37:56 2023] PKRU: 55555554
[Tue Aug 29 05:37:56 2023] Call Trace:
[Tue Aug 29 05:37:56 2023]  <TASK>
[Tue Aug 29 05:37:56 2023]  fletcher_4_avx512f_native+0x1d/0xb0 [zcommon]
[Tue Aug 29 05:37:56 2023]  abd_fletcher_4_iter+0x6b/0xc0 [zcommon]
[Tue Aug 29 05:37:56 2023]  abd_iterate_func.part.0+0x11a/0x1c0 [zfs]
[Tue Aug 29 05:37:56 2023]  ? __pfx_abd_fletcher_4_iter+0x10/0x10 [zcommon]
[Tue Aug 29 05:37:56 2023]  abd_iterate_func+0x1a/0x20 [zfs]
[Tue Aug 29 05:37:56 2023]  abd_fletcher_4_native+0x7c/0xc0 [zfs]
[Tue Aug 29 05:37:56 2023]  ? vdev_queue_io_to_issue+0x4bd/0xd30 [zfs]
[Tue Aug 29 05:37:56 2023]  ? __pfx_abd_fletcher_4_native+0x10/0x10 [zfs]
[Tue Aug 29 05:37:56 2023]  zio_checksum_compute+0x106/0x560 [zfs]
[Tue Aug 29 05:37:56 2023]  ? lz4_compress_zfs+0x178/0x7b0 [zfs]
[Tue Aug 29 05:37:56 2023]  ? kmem_cache_free+0x1e/0x3b0
[Tue Aug 29 05:37:56 2023]  ? spl_kmem_cache_free+0x142/0x1d0 [spl]
[Tue Aug 29 05:37:56 2023]  zio_checksum_generate+0x42/0x80 [zfs]
[Tue Aug 29 05:37:56 2023]  zio_execute+0x92/0x160 [zfs]
[Tue Aug 29 05:37:56 2023]  taskq_thread+0x29c/0x4d0 [spl]
[Tue Aug 29 05:37:56 2023]  ? __pfx_default_wake_function+0x10/0x10
[Tue Aug 29 05:37:56 2023]  ? __pfx_zio_execute+0x10/0x10 [zfs]
[Tue Aug 29 05:37:56 2023]  ? __pfx_taskq_thread+0x10/0x10 [spl]
[Tue Aug 29 05:37:56 2023]  kthread+0xee/0x120
[Tue Aug 29 05:37:56 2023]  ? __pfx_kthread+0x10/0x10
[Tue Aug 29 05:37:56 2023]  ret_from_fork+0x29/0x50
[Tue Aug 29 05:37:56 2023]  </TASK>
[Tue Aug 29 05:37:56 2023] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache netfs veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter bpfilter sctp ip6_udp_tunnel udp_tunnel dm_round_robin nf_tables bonding tls softdog nfnetlink_log nfnetlink dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common i10nm_edac nfit x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel pmt_crashlog pmt_telemetry ipmi_ssif cmdlinepart sha512_ssse3 aesni_intel crypto_simd intel_sdsi ast pmt_class rndis_host spi_nor drm_shmem_helper cdc_ether cryptd rapl qat_4xxx drm_kms_helper i2c_algo_bit intel_qat usbnet syscopyarea intel_cstate pcspkr efi_pstore sysfillrect joydev mei_me input_leds mii isst_if_mmio sysimgblt idxd isst_if_mbox_pci crc8 mtd mei isst_if_common intel_vsec authenc
[Tue Aug 29 05:37:56 2023]  idxd_bus acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad mac_hid pfr_update pfr_telemetry vhost_net vhost vhost_iotlb tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi drm sunrpc ip_tables x_tables autofs4 zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) btrfs blake2b_generic xor raid6_pq libcrc32c simplefb hid_generic usbmouse usbkbd usbhid hid ixgbe xfrm_algo xhci_pci dca crc32_pclmul xhci_pci_renesas i2c_i801 mdio ahci spi_intel_pci i40e spi_intel i2c_smbus libahci i2c_ismt xhci_hcd wmi pinctrl_emmitsburg
[Tue Aug 29 05:37:56 2023] CR2: ff4b96a26ee09cff
[Tue Aug 29 05:37:56 2023] ---[ end trace 0000000000000000 ]---
[Tue Aug 29 05:37:57 2023] RIP: 0010:kfpu_begin+0x31/0x70 [zcommon]
[Tue Aug 29 05:37:57 2023] Code: 3f 48 89 e5 fa 0f 1f 44 00 00 48 8b 15 78 8f 00 00 65 8b 05 5d 0b 40 3f 48 98 48 8b 0c c2 0f 1f 44 00 00 b8 ff ff ff ff 89 c2 <0f> c7 29 5d c3 cc cc cc cc 0f 1f 44 00 00 b8 ff ff ff ff 89 c2 0f
[Tue Aug 29 05:37:57 2023] RSP: 0018:ff6d1d32bc497960 EFLAGS: 00010082
[Tue Aug 29 05:37:57 2023] RAX: 00000000ffffffff RBX: ff4b96a395b74000 RCX: ff4b96a26ee07000
[Tue Aug 29 05:37:57 2023] RDX: 00000000ffffffff RSI: ff4b96a395b74000 RDI: ff6d1d32bc497ac0
[Tue Aug 29 05:37:57 2023] RBP: ff6d1d32bc497960 R08: ff6d1d32bc497aa0 R09: 0000000000001000
[Tue Aug 29 05:37:57 2023] R10: ff4b96a25910c000 R11: 000000000000800b R12: ff4b96a395b75000
[Tue Aug 29 05:37:57 2023] R13: ff6d1d32bc497ac0 R14: 0000000000001000 R15: 0000000000000000
[Tue Aug 29 05:37:57 2023] FS:  0000000000000000(0000) GS:ff4b979dbffc0000(0000) knlGS:0000000000000000
[Tue Aug 29 05:37:57 2023] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Tue Aug 29 05:37:57 2023] CR2: ff4b96a26ee09cff CR3: 0000035c36e10002 CR4: 0000000000773ee0
[Tue Aug 29 05:37:57 2023] PKRU: 55555554
[Tue Aug 29 05:37:57 2023] note: z_wr_iss[3737] exited with irqs disabled
[Tue Aug 29 05:37:57 2023] note: z_wr_iss[3737] exited with preempt_count 1

The host runs odd and will drop out of the cluster from time to time, VM's will lock up etc.

If we go back to 5.15.108-1-pve all is well, those messages aren't in dmesg and the front end is rock solid.

root@ccsprogmiscrit1:~# pveversion
pve-manager/7.4-16/0f39f621 (running kernel: 5.15.108-1-pve)