Random freezes, maybe ZFS related

Status update:

#1: Disabled ARC using primarycache=none -> still crashes
#2: Set aio=threads on all VMs -> still crashes (2024-05-02, SSH possible)
#3: Set cache=none on all VMs -> still crashes (no kernel logs?, 2024-05-03, SSH not possible)
#4: Set intel_iommu=off -> still crashes (no kernel logs?, quite fast after only 1 hour, 2024-05-03, SSH not possible)
#5: Updated BIOS to 2008, 2024-05-03 (still crashes)
#6: Install intel-microcode from debian sid --> 2024-05-03 still crashes (Current revision: 0x00000122 <- Updated early from: 0x0000011f)

#7: Disable KSM -> 2024-05-03 disabled
#8: Deal with MSRs <-- options kvm ignore_msrs=1 report_ignored_msrs=0 set in kvm.conf
#9: go back to kernel 6.5 but leave all the modifications in place -> still crashes

--> #10: Set pcie_aspm=off and pcie_port_pm=off
#11: Set intel_idle.max_cstate=0 and processor.max_cstate=1 -> pending
#12: Set intel_pstate=disable -> pending
#13: Turn off APST using nvme_core.default_ps_max_latency_us=0 -> pending
#14: maybe /sys/block/nvmeXn1/queue/scheduler from none to mq-deadline -> pending
#15: maybe try mitigations=off -> pending
#16: maybe lower the RAM from DDR5-4400 to DDR5-4200
 
Last edited:
Another crash

Code:
May 04 18:03:03 srv02 kernel: VERIFY3(remove_reference(hdr, hdr) > 0) failed (0 > 0)
May 04 18:03:03 srv02 kernel: PANIC at arc.c:6610:arc_write_done()
May 04 18:03:03 srv02 kernel: Showing stack for process 785
May 04 18:03:03 srv02 kernel: CPU: 28 PID: 785 Comm: z_wr_int_2 Tainted: P           O       6.5.13-5-pve #1
May 04 18:03:03 srv02 kernel: Hardware name: ASUSTeK COMPUTER INC. System Product Name/W680/MB DC, BIOS 2008 03/20/2024
May 04 18:03:03 srv02 kernel: Call Trace:
May 04 18:03:03 srv02 kernel:  <TASK>
May 04 18:03:03 srv02 kernel:  dump_stack_lvl+0x48/0x70
May 04 18:03:03 srv02 kernel:  dump_stack+0x10/0x20
May 04 18:03:03 srv02 kernel:  spl_dumpstack+0x29/0x40 [spl]
May 04 18:03:03 srv02 kernel:  spl_panic+0xfc/0x120 [spl]
May 04 18:03:03 srv02 kernel:  arc_write_done+0x44f/0x550 [zfs]
May 04 18:03:03 srv02 kernel:  ? mutex_lock+0x12/0x50
May 04 18:03:03 srv02 kernel:  zio_done+0x289/0x10b0 [zfs]
May 04 18:03:03 srv02 kernel:  ? kfree+0x78/0x120
May 04 18:03:03 srv02 kernel:  zio_execute+0x88/0x130 [zfs]
May 04 18:03:03 srv02 kernel:  taskq_thread+0x27f/0x490 [spl]
May 04 18:03:03 srv02 kernel:  ? __pfx_default_wake_function+0x10/0x10
May 04 18:03:03 srv02 kernel:  ? __pfx_zio_execute+0x10/0x10 [zfs]
May 04 18:03:03 srv02 kernel:  ? __pfx_taskq_thread+0x10/0x10 [spl]
May 04 18:03:03 srv02 kernel:  kthread+0xef/0x120
May 04 18:03:03 srv02 kernel:  ? __pfx_kthread+0x10/0x10
May 04 18:03:03 srv02 kernel:  ret_from_fork+0x44/0x70
May 04 18:03:03 srv02 kernel:  ? __pfx_kthread+0x10/0x10
May 04 18:03:03 srv02 kernel:  ret_from_fork_asm+0x1b/0x30
May 04 18:03:03 srv02 kernel:  </TASK>
May 04 18:03:17 srv02 kernel: BUG: kernel NULL pointer dereference, address: 0000000000000060
May 04 18:03:17 srv02 kernel: #PF: supervisor read access in kernel mode
May 04 18:03:17 srv02 kernel: #PF: error_code(0x0000) - not-present page
May 04 18:03:17 srv02 kernel: PGD 0 P4D 0
May 04 18:03:17 srv02 kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
May 04 18:03:17 srv02 kernel: CPU: 4 PID: 3511 Comm: zvol Tainted: P           O       6.5.13-5-pve #1
May 04 18:03:17 srv02 kernel: Hardware name: ASUSTeK COMPUTER INC. System Product Name/W680/MB DC, BIOS 2008 03/20/2024
May 04 18:03:17 srv02 kernel: RIP: 0010:arc_buf_access+0x15/0x1c0 [zfs]
May 04 18:03:17 srv02 kernel: Code: 00 00 00 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54 53 48 8b 1f <48> 81 7b 60 40 92 c6 c0 0f 84 f5 00 00 00 48 8b 33 48 8b 53 08 48
May 04 18:03:17 srv02 kernel: RSP: 0018:ffff9f7e8ddabb90 EFLAGS: 00010286
May 04 18:03:17 srv02 kernel: RAX: ffff9455646eb720 RBX: 0000000000000000 RCX: 0000000000000000
May 04 18:03:17 srv02 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9454dfd7c960
May 04 18:03:17 srv02 kernel: RBP: ffff9f7e8ddabbb8 R08: 0000000000000000 R09: 0000000000000000
May 04 18:03:17 srv02 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
May 04 18:03:17 srv02 kernel: R13: ffff9455646eb720 R14: 0000000000000000 R15: ffff9f7e8ddabc48
May 04 18:03:17 srv02 kernel: FS:  0000000000000000(0000) GS:ffff9460bf100000(0000) knlGS:0000000000000000
May 04 18:03:17 srv02 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 04 18:03:17 srv02 kernel: CR2: 0000000000000060 CR3: 00000004e2856000 CR4: 0000000000752ee0
May 04 18:03:17 srv02 kernel: PKRU: 55555554
May 04 18:03:17 srv02 kernel: Call Trace:
May 04 18:03:17 srv02 kernel:  <TASK>
May 04 18:03:17 srv02 kernel:  ? show_regs+0x6d/0x80
May 04 18:03:17 srv02 kernel:  ? __die+0x24/0x80
May 04 18:03:17 srv02 kernel:  ? page_fault_oops+0x176/0x500
May 04 18:03:17 srv02 kernel:  ? do_user_addr_fault+0x31d/0x6a0
May 04 18:03:17 srv02 kernel:  ? exc_page_fault+0x83/0x1b0
May 04 18:03:17 srv02 kernel:  ? asm_exc_page_fault+0x27/0x30
May 04 18:03:17 srv02 kernel:  ? arc_buf_access+0x15/0x1c0 [zfs]
May 04 18:03:17 srv02 kernel:  dbuf_hold_impl+0x9a/0x730 [zfs]
May 04 18:03:17 srv02 kernel:  ? zio_create+0x3e8/0x660 [zfs]
May 04 18:03:17 srv02 kernel:  dmu_tx_check_ioerr+0x61/0x110 [zfs]
May 04 18:03:17 srv02 kernel:  dmu_tx_count_write+0xe2/0x1d0 [zfs]
May 04 18:03:17 srv02 kernel:  ? dmu_tx_hold_dnode_impl+0x57/0x130 [zfs]
May 04 18:03:17 srv02 kernel:  dmu_tx_hold_write_by_dnode+0x3a/0x60 [zfs]
May 04 18:03:17 srv02 kernel:  zvol_write+0x226/0x680 [zfs]
May 04 18:03:17 srv02 kernel:  zvol_write_task+0x12/0x30 [zfs]
May 04 18:03:17 srv02 kernel:  taskq_thread+0x27f/0x490 [spl]
May 04 18:03:17 srv02 kernel:  ? __pfx_default_wake_function+0x10/0x10
May 04 18:03:17 srv02 kernel:  ? __pfx_zvol_write_task+0x10/0x10 [zfs]
May 04 18:03:17 srv02 kernel:  ? __pfx_taskq_thread+0x10/0x10 [spl]
May 04 18:03:17 srv02 kernel:  kthread+0xef/0x120
May 04 18:03:17 srv02 kernel:  ? __pfx_kthread+0x10/0x10
May 04 18:03:17 srv02 kernel:  ret_from_fork+0x44/0x70
May 04 18:03:17 srv02 kernel:  ? __pfx_kthread+0x10/0x10
May 04 18:03:17 srv02 kernel:  ret_from_fork_asm+0x1b/0x30
May 04 18:03:17 srv02 kernel:  </TASK>
May 04 18:03:17 srv02 kernel: Modules linked in: tcp_diag inet_diag cmac nls_utf8 cifs cifs_arc4 rdma_cm iw_cm ib_cm ib_core cifs_md4 fscache netfs ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter bpfilter nf_tables bonding tls softdog sunrpc binfmt_misc nfnetlink_log nfnetlink intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common intel_tcc_cooling x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel pmt_telemetry i915 mei_hdcp mei_pxp pmt_class kvm drm_buddy irqbypass ttm crct10dif_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha256_ssse3 sha1_ssse3 aesni_intel drm_display_helper eeepc_wmi asus_wmi crypto_simd ledtrig_audio cec cryptd sparse_keymap rapl intel_cstate cmdlinepart platform_profile spi_nor rc_core serio_raw wmi_bmof mei_me pcspkr drm_kms_helper mtd mei i2c_algo_bit intel_vsec acpi_pad acpi_tad joydev input_leds mac_hid vhost_net vhost vhost_iotlb tap drm efi_pstore dmi_sysfs ip_tables x_tables autofs4 zfs(PO) spl(O) btrfs
May 04 18:03:17 srv02 kernel:  blake2b_generic xor raid6_pq libcrc32c simplefb hid_generic usbmouse usbkbd usbhid hid xhci_pci nvme xhci_pci_renesas crc32_pclmul igc spi_intel_pci i2c_i801 nvme_core xhci_hcd spi_intel i2c_smbus intel_lpss_pci ahci nvme_common intel_lpss libahci idma64 video wmi pinctrl_alderlake
May 04 18:03:17 srv02 kernel: CR2: 0000000000000060
May 04 18:03:17 srv02 kernel: ---[ end trace 0000000000000000 ]---
May 04 18:03:17 srv02 kernel: RIP: 0010:arc_buf_access+0x15/0x1c0 [zfs]
May 04 18:03:17 srv02 kernel: Code: 00 00 00 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54 53 48 8b 1f <48> 81 7b 60 40 92 c6 c0 0f 84 f5 00 00 00 48 8b 33 48 8b 53 08 48
May 04 18:03:17 srv02 kernel: RSP: 0018:ffff9f7e8ddabb90 EFLAGS: 00010286
May 04 18:03:17 srv02 kernel: RAX: ffff9455646eb720 RBX: 0000000000000000 RCX: 0000000000000000
May 04 18:03:17 srv02 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9454dfd7c960
May 04 18:03:17 srv02 kernel: RBP: ffff9f7e8ddabbb8 R08: 0000000000000000 R09: 0000000000000000
May 04 18:03:17 srv02 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
May 04 18:03:17 srv02 kernel: R13: ffff9455646eb720 R14: 0000000000000000 R15: ffff9f7e8ddabc48
May 04 18:03:17 srv02 kernel: FS:  0000000000000000(0000) GS:ffff9460bf100000(0000) knlGS:0000000000000000
May 04 18:03:17 srv02 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 04 18:03:17 srv02 kernel: CR2: 0000000000000060 CR3: 00000004e2856000 CR4: 0000000000752ee0
May 04 18:03:17 srv02 kernel: PKRU: 55555554
May 04 18:03:17 srv02 kernel: note: zvol[3511] exited with irqs disabled
 
Status update:

#1: Disabled ARC using primarycache=none -> still crashes
#2: Set aio=threads on all VMs -> still crashes (2024-05-02, SSH possible)
#3: Set cache=none on all VMs -> still crashes (no kernel logs?, 2024-05-03, SSH not possible)
#4: Set intel_iommu=off -> still crashes (no kernel logs?, quite fast after only 1 hour, 2024-05-03, SSH not possible)
#5: Updated BIOS to 2008, 2024-05-03 (still crashes)
#6: Install intel-microcode from debian sid --> 2024-05-03 still crashes (Current revision: 0x00000122 <- Updated early from: 0x0000011f)

#7: Disable KSM -> 2024-05-03 disabled
#8: Deal with MSRs <-- options kvm ignore_msrs=1 report_ignored_msrs=0 set in kvm.conf
#9: go back to kernel 6.5 but leave all the modifications in place -> still crashes
#10: Set pcie_aspm=off and pcie_port_pm=off -> still crashes

--> #11: Set intel_idle.max_cstate=0 and processor.max_cstate=1

#12: Set intel_pstate=disable -> pending
#13: Turn off APST using nvme_core.default_ps_max_latency_us=0 -> pending
#14: maybe /sys/block/nvmeXn1/queue/scheduler from none to mq-deadline -> pending
#15: maybe try mitigations=off -> pending
#16: maybe lower the RAM from DDR5-4400 to DDR5-4200
 
New crash but without any logs....

Status update:

#1: Disabled ARC using primarycache=none -> still crashes
#2: Set aio=threads on all VMs -> still crashes (2024-05-02, SSH possible)
#3: Set cache=none on all VMs -> still crashes (no kernel logs?, 2024-05-03, SSH not possible)
#4: Set intel_iommu=off -> still crashes (no kernel logs?, quite fast after only 1 hour, 2024-05-03, SSH not possible)
#5: Updated BIOS to 2008, 2024-05-03 (still crashes)
#6: Install intel-microcode from debian sid --> 2024-05-03 still crashes (Current revision: 0x00000122 <- Updated early from: 0x0000011f)

#7: Disable KSM -> 2024-05-03 disabled
#8: Deal with MSRs <-- options kvm ignore_msrs=1 report_ignored_msrs=0 set in kvm.conf
#9: go back to kernel 6.5 but leave all the modifications in place -> still crashes
#10: Set pcie_aspm=off and pcie_port_pm=off -> still crashes
#11: Set intel_idle.max_cstate=0 and processor.max_cstate=1

--> #12: Set intel_pstate=disable
#13: Turn off APST using nvme_core.default_ps_max_latency_us=0 -> pending
#14: maybe /sys/block/nvmeXn1/queue/scheduler from none to mq-deadline -> pending
#15: maybe try mitigations=off -> pending
#16: maybe lower the RAM from DDR5-4400 to DDR5-4200 -> pending
 
Another crash after 10min upime, again without any logs....
30min after that, another crash, so now at #14

Status update:

#1: Disabled ARC using primarycache=none -> still crashes
#2: Set aio=threads on all VMs -> still crashes (2024-05-02, SSH possible)
#3: Set cache=none on all VMs -> still crashes (no kernel logs?, 2024-05-03, SSH not possible)
#4: Set intel_iommu=off -> still crashes (no kernel logs?, quite fast after only 1 hour, 2024-05-03, SSH not possible)
#5: Updated BIOS to 2008, 2024-05-03 (still crashes)
#6: Install intel-microcode from debian sid --> 2024-05-03 still crashes (Current revision: 0x00000122 <- Updated early from: 0x0000011f)
#7: Disable KSM -> 2024-05-03 disabled (still crashes)
#8: Deal with MSRs <-- options kvm ignore_msrs=1 report_ignored_msrs=0 set in kvm.conf (still crashes)
#9: go back to kernel 6.5 and leave all the modifications in place (still crashes)
#10: Set pcie_aspm=off and pcie_port_pm=off (still crashes)
#11: Set intel_idle.max_cstate=0 and processor.max_cstate=1 (still crashes)
#12: Set intel_pstate=disable (still crashes)
#13: Turn off APST using nvme_core.default_ps_max_latency_us=0 (still crashes)

--> #14: Disable GPU Power Management via i915.enable_dc=0
#15: maybe /sys/block/nvmeXn1/queue/scheduler from none to mq-deadline via elevator=mq-deadline -> pending
#16: maybe try mitigations=off -> pending
#17: maybe lower the RAM from DDR5-4400 to DDR5-4200 -> pending
 
Last edited:
Try to use lower version of ZFS. Maybe it will help.

I stuck to kernel 6.2.16-18-pve and ZFS 2.1.13-pve1 and I`m not sure then I`ll upgrade to newer kernel.
 
It's getting worse and worse. I reverted a couple of settings and disabled ASPM in the BIOS completely

Another crash after 10min upime, again without any logs....
30min after that, another crash, so now at #14

Status update:

#1: Disabled ARC using primarycache=none -> still crashes
#2: Set aio=threads on all VMs -> still crashes (2024-05-02, SSH possible)
#3: Set cache=none on all VMs -> still crashes (no kernel logs?, 2024-05-03, SSH not possible)
#4: Set intel_iommu=off -> still crashes (no kernel logs?, quite fast after only 1 hour, 2024-05-03, SSH not possible)
#5: Updated BIOS to 2008, 2024-05-03 (still crashes)
#6: Install intel-microcode from debian sid --> 2024-05-03 still crashes (Current revision: 0x00000122 <- Updated early from: 0x0000011f)
#7: Disable KSM -> 2024-05-03 disabled (still crashes)
#8: Deal with MSRs <-- options kvm ignore_msrs=1 report_ignored_msrs=0 set in kvm.conf (still crashes)

#9: go back to kernel 6.5 and leave all the modifications in place (still crashes)
#10: Set pcie_aspm=off and pcie_port_pm=off (still crashes)
#11: Set intel_idle.max_cstate=0 and processor.max_cstate=1 (still crashes)
#12: Set intel_pstate=disable (still crashes)
#13: Turn off APST using nvme_core.default_ps_max_latency_us=0 (still crashes)
#14: Disable GPU Power Management via i915.enable_dc=0 (still crashes)
#15: /sys/block/nvmeXn1/queue/scheduler from none to mq-deadline (still crashes)

#16: Lower the RAM from DDR5-4400 to DDR5-4200 (still crashes)
-> #17: Revert some of the changes, disable ASPM in the BIOS
#18: maybe try mitigations=off -> pending
 
First, I was convinced that the hardware must be okay, because it was already changed in March 2024.

In March, I also experienced random crashes, but the server was still running ESXi, which isn't officially supported on this hardware.
So, I decided to migrate to Proxmox. I had crashes while installing Proxmox, so the hardware was changed, and I had no more issues with this server until last week.

Finally, after another complete hardware change yesterday, I had no more crashes since more than 16 hours.

So maybe there is an issue with the Asus Pro WS W680 Boards, with the chipset and/or with the i9-13900 (non-K) if the hardware is getting older.

Status update:

#1: Disabled ARC using primarycache=none -> still crashes
#2: Set aio=threads on all VMs -> still crashes (2024-05-02, SSH possible)
#3: Set cache=none on all VMs -> still crashes (no kernel logs?, 2024-05-03, SSH not possible)
#4: Set intel_iommu=off -> still crashes (no kernel logs?, quite fast after only 1 hour, 2024-05-03, SSH not possible)
#5: Updated BIOS to 2008, 2024-05-03 (still crashes)
#6: Install intel-microcode from debian sid --> 2024-05-03 still crashes (Current revision: 0x00000122 <- Updated early from: 0x0000011f)
#7: Disable KSM -> 2024-05-03 disabled (still crashes)
#8: Deal with MSRs <-- options kvm ignore_msrs=1 report_ignored_msrs=0 set in kvm.conf (still crashes)
#9: go back to kernel 6.5 and leave all the modifications in place (still crashes)
#10: Set pcie_aspm=off and pcie_port_pm=off (still crashes)
#11: Set intel_idle.max_cstate=0 and processor.max_cstate=1 (still crashes)
#12: Set intel_pstate=disable (still crashes)
#13: Turn off APST using nvme_core.default_ps_max_latency_us=0 (still crashes)
#14: Disable GPU Power Management via i915.enable_dc=0 (still crashes)
#15: /sys/block/nvmeXn1/queue/scheduler from none to mq-deadline (still crashes)

#16: Lower the RAM from DDR5-4400 to DDR5-4200 (still crashes)
#17: Revert some of the changes, disable ASPM in the BIOS (still crashes)
-> #18: Let Hetzner change the complete hardware, revert most of the changes <- working

BIOS is 2008, standard intel-microcode package from debian stable, kernel is now 6.8.4-2-pve with the initial used cmdline:
Code:
pcie_aspm.policy=performance split_lock_detect=off
 
  • Like
Reactions: logics
@ksb, glad to see it appears to be working.

Did you find a benefit to using cache=none, aio=threads and iothread=1 despite the change in hardware?
 
@ksb, glad to see it appears to be working.

Did you find a benefit to using cache=none, aio=threads and iothread=1 despite the change in hardware?
No, I didn't find a benefit. I think because it is NVMe only, it doesn't make a huge difference in my environment.
 
First, I was convinced that the hardware must be okay, because it was already changed in March 2024.

In March, I also experienced random crashes, but the server was still running ESXi, which isn't officially supported on this hardware.
So, I decided to migrate to Proxmox. I had crashes while installing Proxmox, so the hardware was changed, and I had no more issues with this server until last week.

Finally, after another complete hardware change yesterday, I had no more crashes since more than 16 hours.

So maybe there is an issue with the Asus Pro WS W680 Boards, with the chipset and/or with the i9-13900 (non-K) if the hardware is getting older.

Status update:

#1: Disabled ARC using primarycache=none -> still crashes
#2: Set aio=threads on all VMs -> still crashes (2024-05-02, SSH possible)
#3: Set cache=none on all VMs -> still crashes (no kernel logs?, 2024-05-03, SSH not possible)
#4: Set intel_iommu=off -> still crashes (no kernel logs?, quite fast after only 1 hour, 2024-05-03, SSH not possible)
#5: Updated BIOS to 2008, 2024-05-03 (still crashes)
#6: Install intel-microcode from debian sid --> 2024-05-03 still crashes (Current revision: 0x00000122 <- Updated early from: 0x0000011f)
#7: Disable KSM -> 2024-05-03 disabled (still crashes)
#8: Deal with MSRs <-- options kvm ignore_msrs=1 report_ignored_msrs=0 set in kvm.conf (still crashes)
#9: go back to kernel 6.5 and leave all the modifications in place (still crashes)
#10: Set pcie_aspm=off and pcie_port_pm=off (still crashes)
#11: Set intel_idle.max_cstate=0 and processor.max_cstate=1 (still crashes)
#12: Set intel_pstate=disable (still crashes)
#13: Turn off APST using nvme_core.default_ps_max_latency_us=0 (still crashes)
#14: Disable GPU Power Management via i915.enable_dc=0 (still crashes)
#15: /sys/block/nvmeXn1/queue/scheduler from none to mq-deadline (still crashes)

#16: Lower the RAM from DDR5-4400 to DDR5-4200 (still crashes)
#17: Revert some of the changes, disable ASPM in the BIOS (still crashes)
-> #18: Let Hetzner change the complete hardware, revert most of the changes <- working

BIOS is 2008, standard intel-microcode package from debian stable, kernel is now 6.8.4-2-pve with the initial used cmdline:
Code:
pcie_aspm.policy=performance split_lock_detect=off

Fantastic troubleshooting thread! I am using a Asus Pro WS W680-ACE Board with an Intel i9-14900k, everything was rock solid for the first 3-4+ months and then now I am getting random segmentation faults, kernel panic, ruled out RAM/memtest. Appears to be CPU related.

When you swapped out the hardware, what did you move to?
 
  • Like
Reactions: logics
Fantastic troubleshooting thread! I am using a Asus Pro WS W680-ACE Board with an Intel i9-14900k, everything was rock solid for the first 3-4+ months and then now I am getting random segmentation faults, kernel panic, ruled out RAM/memtest. Appears to be CPU related.

When you swapped out the hardware, what did you move to?
Hetzner replaced the whole server (1:1 with the same components).
 
  • Like
Reactions: logics
Hello fellow Intel 13900 user. I already had a similar problem on one server on November 27th 2024 (https://forum.proxmox.com/threads/k...m_cache_alloc-0x37b-0x380.158134/#post-727767), now I got another problem on a different server, same hardware (Hetzner EX101, Intel i9-13900, ASRockRack W680D4U-1L, EEC memory, NVMe drives) on December 11th 2024.

It could not log on `/var/log/syslog` but sent the logs successfully to NewRelic (log server etc.)

I got a very similar crash here (same PANIC at arc.c:XXXX:arc_write_done()):

Code:
2024-12-11T18:04:20.670628+01:00 13900HostFsn kernel: [4390751.721029] PANIC at arc.c:6622:arc_write_done()
2024-12-11T18:04:20.670629+01:00 13900HostFsn kernel: [4390751.721031] Showing stack for process 3503530
2024-12-11T18:04:20.670629+01:00 13900HostFsn kernel: [4390751.721032] CPU: 8 PID: 3503530 Comm: z_wr_int_4 Tainted: P    B   W  O       6.8.12-2-pve #1
2024-12-11T18:04:20.670630+01:00 13900HostFsn kernel: [4390751.721034] Hardware name: Hetzner /W680D4U-1L, BIOS 10.28 03/10/2023
2024-12-11T18:04:20.670630+01:00 13900HostFsn kernel: [4390751.721036] Call Trace:
2024-12-11T18:04:20.670630+01:00 13900HostFsn kernel: [4390751.721037]  <TASK>
2024-12-11T18:04:20.670631+01:00 13900HostFsn kernel: [4390751.721040]  dump_stack_lvl+0x76/0xa0
2024-12-11T18:04:20.670631+01:00 13900HostFsn kernel: [4390751.721046]  dump_stack+0x10/0x20
2024-12-11T18:04:20.670632+01:00 13900HostFsn kernel: [4390751.721049]  spl_dumpstack+0x29/0x40 [spl]
2024-12-11T18:04:20.670632+01:00 13900HostFsn kernel: [4390751.721057]  spl_panic+0xfc/0x120 [spl]
2024-12-11T18:04:20.670633+01:00 13900HostFsn kernel: [4390751.721065]  arc_write_done+0x44f/0x550 [zfs]
2024-12-11T18:04:20.670633+01:00 13900HostFsn kernel: [4390751.721192]  zio_done+0x289/0x10b0 [zfs]
2024-12-11T18:04:20.670633+01:00 13900HostFsn kernel: [4390751.721307]  zio_execute+0x88/0x130 [zfs]
2024-12-11T18:04:20.670649+01:00 13900HostFsn kernel: [4390751.721434]  taskq_thread+0x27f/0x4c0 [spl]
2024-12-11T18:04:20.670650+01:00 13900HostFsn kernel: [4390751.721445]  ? __pfx_default_wake_function+0x10/0x10
2024-12-11T18:04:20.670650+01:00 13900HostFsn kernel: [4390751.721452]  ? __pfx_zio_execute+0x10/0x10 [zfs]
2024-12-11T18:04:20.670650+01:00 13900HostFsn kernel: [4390751.721597]  ? __pfx_taskq_thread+0x10/0x10 [spl]
2024-12-11T18:04:20.670650+01:00 13900HostFsn kernel: [4390751.721609]  kthread+0xef/0x120
2024-12-11T18:04:20.670651+01:00 13900HostFsn kernel: [4390751.721613]  ? __pfx_kthread+0x10/0x10
2024-12-11T18:04:20.670660+01:00 13900HostFsn kernel: [4390751.721615]  ret_from_fork+0x44/0x70
2024-12-11T18:04:20.670660+01:00 13900HostFsn kernel: [4390751.721619]  ? __pfx_kthread+0x10/0x10
2024-12-11T18:04:20.670661+01:00 13900HostFsn kernel: [4390751.721621]  ret_from_fork_asm+0x1b/0x30
2024-12-11T18:04:20.670661+01:00 13900HostFsn kernel: [4390751.721624]  </TASK>
2024-12-11T18:04:20.670611+01:00 13900HostFsn kernel: [4390751.721023] VERIFY3(remove_reference(hdr, hdr) > 0) failed (0 > 0)

system is PVE 8.2.7 (kernel 6.8.12-2-pve but that's the kernel version after reboot and I don't know if before the reboot an older kernel had been loaded).

Crash related are the the following logs then:

Code:
2024-12-11T18:05:02.325363+01:00 13900HostFsn vzdump[3504053]: INFO: starting new backup job: vzdump 101 --storage local101 --mode snapshot --mailto XXX@YYY.de --quiet 1 --mailnotification failure
2024-12-11T18:05:02.277457+01:00 13900HostFsn vzdump[3504052]: <root@pam> starting task UPID:13900HostFsn:003577B5:1A2BE590:6759C63E:vzdump:101:root@pam:
[...]
2024-12-11T18:07:33.185495+01:00 13900HostFsn kernel: [4390944.233463]  dmu_tx_count_write+0xe2/0x1d0 [zfs]
2024-12-11T18:07:33.185495+01:00 13900HostFsn kernel: [4390944.233588]  ? dmu_tx_hold_dnode_impl+0x57/0x130 [zfs]
2024-12-11T18:07:33.185495+01:00 13900HostFsn kernel: [4390944.233713]  dmu_tx_hold_write_by_dnode+0x3a/0x60 [zfs]
2024-12-11T18:07:33.185495+01:00 13900HostFsn kernel: [4390944.233840]  zvol_write+0x223/0x670 [zfs]
2024-12-11T18:07:33.185496+01:00 13900HostFsn kernel: [4390944.233945]  zvol_write_task+0x12/0x30 [zfs]
2024-12-11T18:07:33.185496+01:00 13900HostFsn kernel: [4390944.234050]  taskq_thread+0x27f/0x4c0 [spl]
2024-12-11T18:07:33.185496+01:00 13900HostFsn kernel: [4390944.234060]  ? __pfx_default_wake_function+0x10/0x10
2024-12-11T18:07:33.185496+01:00 13900HostFsn kernel: [4390944.234063]  ? __pfx_zvol_write_task+0x10/0x10 [zfs]
2024-12-11T18:07:33.185496+01:00 13900HostFsn kernel: [4390944.234167]  ? __pfx_taskq_thread+0x10/0x10 [spl]
2024-12-11T18:07:33.185497+01:00 13900HostFsn kernel: [4390944.234177]  kthread+0xef/0x120
2024-12-11T18:07:33.185497+01:00 13900HostFsn kernel: [4390944.234179]  ? __pfx_kthread+0x10/0x10
2024-12-11T18:07:33.185497+01:00 13900HostFsn kernel: [4390944.234181]  ret_from_fork+0x44/0x70
2024-12-11T18:07:33.185497+01:00 13900HostFsn kernel: [4390944.234183]  ? __pfx_kthread+0x10/0x10
2024-12-11T18:07:33.185497+01:00 13900HostFsn kernel: [4390944.234185]  ret_from_fork_asm+0x1b/0x30
2024-12-11T18:07:33.185497+01:00 13900HostFsn kernel: [4390944.234187]  </TASK>
2024-12-11T18:07:33.185494+01:00 13900HostFsn kernel: [4390944.233337]  dmu_tx_check_ioerr+0x61/0x110 [zfs]
2024-12-11T18:07:33.184541+01:00 13900HostFsn kernel: [4390944.232430]  ? __pfx_taskq_thread+0x10/0x10 [spl]
2024-12-11T18:07:33.184541+01:00 13900HostFsn kernel: [4390944.232439]  kthread+0xef/0x120
2024-12-11T18:07:33.184541+01:00 13900HostFsn kernel: [4390944.232441]  ? __pfx_kthread+0x10/0x10
2024-12-11T18:07:33.184542+01:00 13900HostFsn kernel: [4390944.232443]  ret_from_fork+0x44/0x70
2024-12-11T18:07:33.184542+01:00 13900HostFsn kernel: [4390944.232446]  ? __pfx_kthread+0x10/0x10
2024-12-11T18:07:33.184542+01:00 13900HostFsn kernel: [4390944.232448]  ret_from_fork_asm+0x1b/0x30
2024-12-11T18:07:33.184542+01:00 13900HostFsn kernel: [4390944.232450]  </TASK>
2024-12-11T18:07:33.184543+01:00 13900HostFsn kernel: [4390944.232454] INFO: task z_wr_int_4:3503530 blocked for more than 122 seconds.
2024-12-11T18:07:33.184543+01:00 13900HostFsn kernel: [4390944.232455]       Tainted: P    B   W  O       6.8.12-2-pve #1
2024-12-11T18:07:33.184543+01:00 13900HostFsn kernel: [4390944.232457] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2024-12-11T18:07:33.184543+01:00 13900HostFsn kernel: [4390944.232459] task:z_wr_int_4      state:D stack:0     pid:3503530 tgid:3503530 ppid:2      flags:0x00004000
2024-12-11T18:07:33.184543+01:00 13900HostFsn kernel: [4390944.232461] Call Trace:
2024-12-11T18:07:33.184544+01:00 13900HostFsn kernel: [4390944.232463]  <TASK>
2024-12-11T18:07:33.184544+01:00 13900HostFsn kernel: [4390944.232464]  __schedule+0x401/0x15e0
2024-12-11T18:07:33.184544+01:00 13900HostFsn kernel: [4390944.232467]  schedule+0x33/0x110
2024-12-11T18:07:33.184544+01:00 13900HostFsn kernel: [4390944.232469]  spl_panic+0x112/0x120 [spl]
2024-12-11T18:07:33.184544+01:00 13900HostFsn kernel: [4390944.232480]  arc_write_done+0x44f/0x550 [zfs]
2024-12-11T18:07:33.184544+01:00 13900HostFsn kernel: [4390944.232594]  zio_done+0x289/0x10b0 [zfs]
2024-12-11T18:07:33.184544+01:00 13900HostFsn kernel: [4390944.232706]  zio_execute+0x88/0x130 [zfs]
2024-12-11T18:07:33.184545+01:00 13900HostFsn kernel: [4390944.232819]  taskq_thread+0x27f/0x4c0 [spl]
2024-12-11T18:07:33.184545+01:00 13900HostFsn kernel: [4390944.232828]  ? __pfx_default_wake_function+0x10/0x10
2024-12-11T18:07:33.184545+01:00 13900HostFsn kernel: [4390944.232831]  ? __pfx_zio_execute+0x10/0x10 [zfs]
2024-12-11T18:07:33.184545+01:00 13900HostFsn kernel: [4390944.232944]  ? __pfx_taskq_thread+0x10/0x10 [spl]
2024-12-11T18:07:33.184545+01:00 13900HostFsn kernel: [4390944.232953]  kthread+0xef/0x120
2024-12-11T18:07:33.184545+01:00 13900HostFsn kernel: [4390944.232955]  ? __pfx_kthread+0x10/0x10
2024-12-11T18:07:33.184546+01:00 13900HostFsn kernel: [4390944.232957]  ret_from_fork+0x44/0x70
2024-12-11T18:07:33.184546+01:00 13900HostFsn kernel: [4390944.232959]  ? __pfx_kthread+0x10/0x10
2024-12-11T18:07:33.184546+01:00 13900HostFsn kernel: [4390944.232961]  ret_from_fork_asm+0x1b/0x30
2024-12-11T18:07:33.184546+01:00 13900HostFsn kernel: [4390944.232963]  </TASK>
2024-12-11T18:07:33.184546+01:00 13900HostFsn kernel: [4390944.232965] INFO: task zvol_tq-2:3503540 blocked for more than 122 seconds.
2024-12-11T18:07:33.184546+01:00 13900HostFsn kernel: [4390944.232966]       Tainted: P    B   W  O       6.8.12-2-pve #1
2024-12-11T18:07:33.184547+01:00 13900HostFsn kernel: [4390944.232968] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2024-12-11T18:07:33.184547+01:00 13900HostFsn kernel: [4390944.232969] task:zvol_tq-2       state:D stack:0     pid:3503540 tgid:3503540 ppid:2      flags:0x00004000
2024-12-11T18:07:33.184547+01:00 13900HostFsn kernel: [4390944.232972] Call Trace:
2024-12-11T18:07:33.184547+01:00 13900HostFsn kernel: [4390944.232973]  <TASK>
2024-12-11T18:07:33.184547+01:00 13900HostFsn kernel: [4390944.232974]  __schedule+0x401/0x15e0
2024-12-11T18:07:33.184552+01:00 13900HostFsn kernel: [4390944.232977]  schedule+0x33/0x110
2024-12-11T18:07:33.184552+01:00 13900HostFsn kernel: [4390944.232980]  schedule_preempt_disabled+0x15/0x30
2024-12-11T18:07:33.184553+01:00 13900HostFsn kernel: [4390944.232982]  __mutex_lock.constprop.0+0x3f8/0x7a0
2024-12-11T18:07:33.184553+01:00 13900HostFsn kernel: [4390944.232985]  __mutex_lock_slowpath+0x13/0x20
2024-12-11T18:07:33.184553+01:00 13900HostFsn kernel: [4390944.232987]  mutex_lock+0x3c/0x50
2024-12-11T18:07:33.184553+01:00 13900HostFsn kernel: [4390944.232989]  arc_buf_access+0x6f/0x1c0 [zfs]
2024-12-11T18:07:33.184553+01:00 13900HostFsn kernel: [4390944.233104]  dbuf_hold_impl+0x9a/0x730 [zfs]
2024-12-11T18:07:33.184554+01:00 13900HostFsn kernel: [4390944.233223]  ? zio_create+0x3e8/0x660 [zfs]
2024-12-11T18:07:33.184540+01:00 13900HostFsn kernel: [4390944.232316]  ? __pfx_zio_execute+0x10/0x10 [zfs]
2024-12-11T18:07:33.183585+01:00 13900HostFsn kernel: [4390944.231453]  zio_done+0x289/0x10b0 [zfs]
2024-12-11T18:07:33.183586+01:00 13900HostFsn kernel: [4390944.231566]  zio_execute+0x88/0x130 [zfs]
2024-12-11T18:07:33.183586+01:00 13900HostFsn kernel: [4390944.231678]  taskq_thread+0x27f/0x4c0 [spl]
2024-12-11T18:07:33.183587+01:00 13900HostFsn kernel: [4390944.231688]  ? __pfx_default_wake_function+0x10/0x10
2024-12-11T18:07:33.183587+01:00 13900HostFsn kernel: [4390944.231691]  ? __pfx_zio_execute+0x10/0x10 [zfs]
2024-12-11T18:07:33.183587+01:00 13900HostFsn kernel: [4390944.231804]  ? __pfx_taskq_thread+0x10/0x10 [spl]
2024-12-11T18:07:33.183587+01:00 13900HostFsn kernel: [4390944.231813]  kthread+0xef/0x120
2024-12-11T18:07:33.183588+01:00 13900HostFsn kernel: [4390944.231815]  ? __pfx_kthread+0x10/0x10
2024-12-11T18:07:33.183588+01:00 13900HostFsn kernel: [4390944.231817]  ret_from_fork+0x44/0x70
2024-12-11T18:07:33.183588+01:00 13900HostFsn kernel: [4390944.231820]  ? __pfx_kthread+0x10/0x10
2024-12-11T18:07:33.183588+01:00 13900HostFsn kernel: [4390944.231821]  ret_from_fork_asm+0x1b/0x30
2024-12-11T18:07:33.183589+01:00 13900HostFsn kernel: [4390944.231824]  </TASK>
2024-12-11T18:07:33.183589+01:00 13900HostFsn kernel: [4390944.231826] INFO: task z_wr_int_4:3503235 blocked for more than 122 seconds.
2024-12-11T18:07:33.183589+01:00 13900HostFsn kernel: [4390944.231828]       Tainted: P    B   W  O       6.8.12-2-pve #1
2024-12-11T18:07:33.183589+01:00 13900HostFsn kernel: [4390944.231829] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2024-12-11T18:07:33.183590+01:00 13900HostFsn kernel: [4390944.231831] task:z_wr_int_4      state:D stack:0     pid:3503235 tgid:3503235 ppid:2      flags:0x00004000
[...]

edit: attached full logs


Anyhow since this is my production server I will no longer rely on Intel 13900 CPUs and switch to AMD for a while...
 

Attachments

Last edited: