Random freezes, maybe ZFS related

ksb · May 4, 2024

@rj45 , why should I remove split_lock_detect=off ?
I added this because I had a lot of split lock detects -> https://forum.proxmox.com/threads/x86-split-lock-detection.111544/

What about setting pcie_port_pm also to off?

Code:

pcie_port_pm=off

ksb · May 4, 2024

Status update:

#1: Disabled ARC using primarycache=none -> still crashes
#2: Set aio=threads on all VMs -> still crashes (2024-05-02, SSH possible)
#3: Set cache=none on all VMs -> still crashes (no kernel logs?, 2024-05-03, SSH not possible)
#4: Set intel_iommu=off -> still crashes (no kernel logs?, quite fast after only 1 hour, 2024-05-03, SSH not possible)
#5: Updated BIOS to 2008, 2024-05-03 (still crashes)
#6: Install intel-microcode from debian sid --> 2024-05-03 still crashes (Current revision: 0x00000122 <- Updated early from: 0x0000011f)
#7: Disable KSM -> 2024-05-03 disabled
#8: Deal with MSRs <-- options kvm ignore_msrs=1 report_ignored_msrs=0 set in kvm.conf
#9: go back to kernel 6.5 but leave all the modifications in place -> still crashes
--> #10: Set pcie_aspm=off and pcie_port_pm=off
#11: Set intel_idle.max_cstate=0 and processor.max_cstate=1 -> pending
#12: Set intel_pstate=disable -> pending
#13: Turn off APST using nvme_core.default_ps_max_latency_us=0 -> pending
#14: maybe /sys/block/nvmeXn1/queue/scheduler from none to mq-deadline -> pending
#15: maybe try mitigations=off -> pending
#16: maybe lower the RAM from DDR5-4400 to DDR5-4200

ksb · May 4, 2024

Another crash

Code:

May 04 18:03:03 srv02 kernel: VERIFY3(remove_reference(hdr, hdr) > 0) failed (0 > 0)
May 04 18:03:03 srv02 kernel: PANIC at arc.c:6610:arc_write_done()
May 04 18:03:03 srv02 kernel: Showing stack for process 785
May 04 18:03:03 srv02 kernel: CPU: 28 PID: 785 Comm: z_wr_int_2 Tainted: P           O       6.5.13-5-pve #1
May 04 18:03:03 srv02 kernel: Hardware name: ASUSTeK COMPUTER INC. System Product Name/W680/MB DC, BIOS 2008 03/20/2024
May 04 18:03:03 srv02 kernel: Call Trace:
May 04 18:03:03 srv02 kernel:  <TASK>
May 04 18:03:03 srv02 kernel:  dump_stack_lvl+0x48/0x70
May 04 18:03:03 srv02 kernel:  dump_stack+0x10/0x20
May 04 18:03:03 srv02 kernel:  spl_dumpstack+0x29/0x40 [spl]
May 04 18:03:03 srv02 kernel:  spl_panic+0xfc/0x120 [spl]
May 04 18:03:03 srv02 kernel:  arc_write_done+0x44f/0x550 [zfs]
May 04 18:03:03 srv02 kernel:  ? mutex_lock+0x12/0x50
May 04 18:03:03 srv02 kernel:  zio_done+0x289/0x10b0 [zfs]
May 04 18:03:03 srv02 kernel:  ? kfree+0x78/0x120
May 04 18:03:03 srv02 kernel:  zio_execute+0x88/0x130 [zfs]
May 04 18:03:03 srv02 kernel:  taskq_thread+0x27f/0x490 [spl]
May 04 18:03:03 srv02 kernel:  ? __pfx_default_wake_function+0x10/0x10
May 04 18:03:03 srv02 kernel:  ? __pfx_zio_execute+0x10/0x10 [zfs]
May 04 18:03:03 srv02 kernel:  ? __pfx_taskq_thread+0x10/0x10 [spl]
May 04 18:03:03 srv02 kernel:  kthread+0xef/0x120
May 04 18:03:03 srv02 kernel:  ? __pfx_kthread+0x10/0x10
May 04 18:03:03 srv02 kernel:  ret_from_fork+0x44/0x70
May 04 18:03:03 srv02 kernel:  ? __pfx_kthread+0x10/0x10
May 04 18:03:03 srv02 kernel:  ret_from_fork_asm+0x1b/0x30
May 04 18:03:03 srv02 kernel:  </TASK>
May 04 18:03:17 srv02 kernel: BUG: kernel NULL pointer dereference, address: 0000000000000060
May 04 18:03:17 srv02 kernel: #PF: supervisor read access in kernel mode
May 04 18:03:17 srv02 kernel: #PF: error_code(0x0000) - not-present page
May 04 18:03:17 srv02 kernel: PGD 0 P4D 0
May 04 18:03:17 srv02 kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
May 04 18:03:17 srv02 kernel: CPU: 4 PID: 3511 Comm: zvol Tainted: P           O       6.5.13-5-pve #1
May 04 18:03:17 srv02 kernel: Hardware name: ASUSTeK COMPUTER INC. System Product Name/W680/MB DC, BIOS 2008 03/20/2024
May 04 18:03:17 srv02 kernel: RIP: 0010:arc_buf_access+0x15/0x1c0 [zfs]
May 04 18:03:17 srv02 kernel: Code: 00 00 00 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54 53 48 8b 1f <48> 81 7b 60 40 92 c6 c0 0f 84 f5 00 00 00 48 8b 33 48 8b 53 08 48
May 04 18:03:17 srv02 kernel: RSP: 0018:ffff9f7e8ddabb90 EFLAGS: 00010286
May 04 18:03:17 srv02 kernel: RAX: ffff9455646eb720 RBX: 0000000000000000 RCX: 0000000000000000
May 04 18:03:17 srv02 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9454dfd7c960
May 04 18:03:17 srv02 kernel: RBP: ffff9f7e8ddabbb8 R08: 0000000000000000 R09: 0000000000000000
May 04 18:03:17 srv02 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
May 04 18:03:17 srv02 kernel: R13: ffff9455646eb720 R14: 0000000000000000 R15: ffff9f7e8ddabc48
May 04 18:03:17 srv02 kernel: FS:  0000000000000000(0000) GS:ffff9460bf100000(0000) knlGS:0000000000000000
May 04 18:03:17 srv02 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 04 18:03:17 srv02 kernel: CR2: 0000000000000060 CR3: 00000004e2856000 CR4: 0000000000752ee0
May 04 18:03:17 srv02 kernel: PKRU: 55555554
May 04 18:03:17 srv02 kernel: Call Trace:
May 04 18:03:17 srv02 kernel:  <TASK>
May 04 18:03:17 srv02 kernel:  ? show_regs+0x6d/0x80
May 04 18:03:17 srv02 kernel:  ? __die+0x24/0x80
May 04 18:03:17 srv02 kernel:  ? page_fault_oops+0x176/0x500
May 04 18:03:17 srv02 kernel:  ? do_user_addr_fault+0x31d/0x6a0
May 04 18:03:17 srv02 kernel:  ? exc_page_fault+0x83/0x1b0
May 04 18:03:17 srv02 kernel:  ? asm_exc_page_fault+0x27/0x30
May 04 18:03:17 srv02 kernel:  ? arc_buf_access+0x15/0x1c0 [zfs]
May 04 18:03:17 srv02 kernel:  dbuf_hold_impl+0x9a/0x730 [zfs]
May 04 18:03:17 srv02 kernel:  ? zio_create+0x3e8/0x660 [zfs]
May 04 18:03:17 srv02 kernel:  dmu_tx_check_ioerr+0x61/0x110 [zfs]
May 04 18:03:17 srv02 kernel:  dmu_tx_count_write+0xe2/0x1d0 [zfs]
May 04 18:03:17 srv02 kernel:  ? dmu_tx_hold_dnode_impl+0x57/0x130 [zfs]
May 04 18:03:17 srv02 kernel:  dmu_tx_hold_write_by_dnode+0x3a/0x60 [zfs]
May 04 18:03:17 srv02 kernel:  zvol_write+0x226/0x680 [zfs]
May 04 18:03:17 srv02 kernel:  zvol_write_task+0x12/0x30 [zfs]
May 04 18:03:17 srv02 kernel:  taskq_thread+0x27f/0x490 [spl]
May 04 18:03:17 srv02 kernel:  ? __pfx_default_wake_function+0x10/0x10
May 04 18:03:17 srv02 kernel:  ? __pfx_zvol_write_task+0x10/0x10 [zfs]
May 04 18:03:17 srv02 kernel:  ? __pfx_taskq_thread+0x10/0x10 [spl]
May 04 18:03:17 srv02 kernel:  kthread+0xef/0x120
May 04 18:03:17 srv02 kernel:  ? __pfx_kthread+0x10/0x10
May 04 18:03:17 srv02 kernel:  ret_from_fork+0x44/0x70
May 04 18:03:17 srv02 kernel:  ? __pfx_kthread+0x10/0x10
May 04 18:03:17 srv02 kernel:  ret_from_fork_asm+0x1b/0x30
May 04 18:03:17 srv02 kernel:  </TASK>
May 04 18:03:17 srv02 kernel: Modules linked in: tcp_diag inet_diag cmac nls_utf8 cifs cifs_arc4 rdma_cm iw_cm ib_cm ib_core cifs_md4 fscache netfs ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter bpfilter nf_tables bonding tls softdog sunrpc binfmt_misc nfnetlink_log nfnetlink intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common intel_tcc_cooling x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel pmt_telemetry i915 mei_hdcp mei_pxp pmt_class kvm drm_buddy irqbypass ttm crct10dif_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha256_ssse3 sha1_ssse3 aesni_intel drm_display_helper eeepc_wmi asus_wmi crypto_simd ledtrig_audio cec cryptd sparse_keymap rapl intel_cstate cmdlinepart platform_profile spi_nor rc_core serio_raw wmi_bmof mei_me pcspkr drm_kms_helper mtd mei i2c_algo_bit intel_vsec acpi_pad acpi_tad joydev input_leds mac_hid vhost_net vhost vhost_iotlb tap drm efi_pstore dmi_sysfs ip_tables x_tables autofs4 zfs(PO) spl(O) btrfs
May 04 18:03:17 srv02 kernel:  blake2b_generic xor raid6_pq libcrc32c simplefb hid_generic usbmouse usbkbd usbhid hid xhci_pci nvme xhci_pci_renesas crc32_pclmul igc spi_intel_pci i2c_i801 nvme_core xhci_hcd spi_intel i2c_smbus intel_lpss_pci ahci nvme_common intel_lpss libahci idma64 video wmi pinctrl_alderlake
May 04 18:03:17 srv02 kernel: CR2: 0000000000000060
May 04 18:03:17 srv02 kernel: ---[ end trace 0000000000000000 ]---
May 04 18:03:17 srv02 kernel: RIP: 0010:arc_buf_access+0x15/0x1c0 [zfs]
May 04 18:03:17 srv02 kernel: Code: 00 00 00 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54 53 48 8b 1f <48> 81 7b 60 40 92 c6 c0 0f 84 f5 00 00 00 48 8b 33 48 8b 53 08 48
May 04 18:03:17 srv02 kernel: RSP: 0018:ffff9f7e8ddabb90 EFLAGS: 00010286
May 04 18:03:17 srv02 kernel: RAX: ffff9455646eb720 RBX: 0000000000000000 RCX: 0000000000000000
May 04 18:03:17 srv02 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9454dfd7c960
May 04 18:03:17 srv02 kernel: RBP: ffff9f7e8ddabbb8 R08: 0000000000000000 R09: 0000000000000000
May 04 18:03:17 srv02 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
May 04 18:03:17 srv02 kernel: R13: ffff9455646eb720 R14: 0000000000000000 R15: ffff9f7e8ddabc48
May 04 18:03:17 srv02 kernel: FS:  0000000000000000(0000) GS:ffff9460bf100000(0000) knlGS:0000000000000000
May 04 18:03:17 srv02 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 04 18:03:17 srv02 kernel: CR2: 0000000000000060 CR3: 00000004e2856000 CR4: 0000000000752ee0
May 04 18:03:17 srv02 kernel: PKRU: 55555554
May 04 18:03:17 srv02 kernel: note: zvol[3511] exited with irqs disabled

ksb · May 4, 2024

Status update:

#1: Disabled ARC using primarycache=none -> still crashes
#2: Set aio=threads on all VMs -> still crashes (2024-05-02, SSH possible)
#3: Set cache=none on all VMs -> still crashes (no kernel logs?, 2024-05-03, SSH not possible)
#4: Set intel_iommu=off -> still crashes (no kernel logs?, quite fast after only 1 hour, 2024-05-03, SSH not possible)
#5: Updated BIOS to 2008, 2024-05-03 (still crashes)
#6: Install intel-microcode from debian sid --> 2024-05-03 still crashes (Current revision: 0x00000122 <- Updated early from: 0x0000011f)
#7: Disable KSM -> 2024-05-03 disabled
#8: Deal with MSRs <-- options kvm ignore_msrs=1 report_ignored_msrs=0 set in kvm.conf
#9: go back to kernel 6.5 but leave all the modifications in place -> still crashes
#10: Set pcie_aspm=off and pcie_port_pm=off -> still crashes
--> #11: Set intel_idle.max_cstate=0 and processor.max_cstate=1
#12: Set intel_pstate=disable -> pending
#13: Turn off APST using nvme_core.default_ps_max_latency_us=0 -> pending
#14: maybe /sys/block/nvmeXn1/queue/scheduler from none to mq-deadline -> pending
#15: maybe try mitigations=off -> pending
#16: maybe lower the RAM from DDR5-4400 to DDR5-4200

ksb · May 4, 2024

New crash but without any logs....

Status update:

#1: Disabled ARC using primarycache=none -> still crashes
#2: Set aio=threads on all VMs -> still crashes (2024-05-02, SSH possible)
#3: Set cache=none on all VMs -> still crashes (no kernel logs?, 2024-05-03, SSH not possible)
#4: Set intel_iommu=off -> still crashes (no kernel logs?, quite fast after only 1 hour, 2024-05-03, SSH not possible)
#5: Updated BIOS to 2008, 2024-05-03 (still crashes)
#6: Install intel-microcode from debian sid --> 2024-05-03 still crashes (Current revision: 0x00000122 <- Updated early from: 0x0000011f)
#7: Disable KSM -> 2024-05-03 disabled
#8: Deal with MSRs <-- options kvm ignore_msrs=1 report_ignored_msrs=0 set in kvm.conf
#9: go back to kernel 6.5 but leave all the modifications in place -> still crashes
#10: Set pcie_aspm=off and pcie_port_pm=off -> still crashes
#11: Set intel_idle.max_cstate=0 and processor.max_cstate=1
--> #12: Set intel_pstate=disable
#13: Turn off APST using nvme_core.default_ps_max_latency_us=0 -> pending
#14: maybe /sys/block/nvmeXn1/queue/scheduler from none to mq-deadline -> pending
#15: maybe try mitigations=off -> pending
#16: maybe lower the RAM from DDR5-4400 to DDR5-4200 -> pending

ksb · May 4, 2024

Another crash after 10min upime, again without any logs....
30min after that, another crash, so now at #14

Status update:

#1: Disabled ARC using primarycache=none -> still crashes
#2: Set aio=threads on all VMs -> still crashes (2024-05-02, SSH possible)
#3: Set cache=none on all VMs -> still crashes (no kernel logs?, 2024-05-03, SSH not possible)
#4: Set intel_iommu=off -> still crashes (no kernel logs?, quite fast after only 1 hour, 2024-05-03, SSH not possible)
#5: Updated BIOS to 2008, 2024-05-03 (still crashes)
#6: Install intel-microcode from debian sid --> 2024-05-03 still crashes (Current revision: 0x00000122 <- Updated early from: 0x0000011f)
#7: Disable KSM -> 2024-05-03 disabled (still crashes)
#8: Deal with MSRs <-- options kvm ignore_msrs=1 report_ignored_msrs=0 set in kvm.conf (still crashes)
#9: go back to kernel 6.5 and leave all the modifications in place (still crashes)
#10: Set pcie_aspm=off and pcie_port_pm=off (still crashes)
#11: Set intel_idle.max_cstate=0 and processor.max_cstate=1 (still crashes)
#12: Set intel_pstate=disable (still crashes)
#13: Turn off APST using nvme_core.default_ps_max_latency_us=0 (still crashes)
--> #14: Disable GPU Power Management via i915.enable_dc=0
#15: maybe /sys/block/nvmeXn1/queue/scheduler from none to mq-deadline via elevator=mq-deadline -> pending
#16: maybe try mitigations=off -> pending
#17: maybe lower the RAM from DDR5-4400 to DDR5-4200 -> pending

Nemesiz · May 5, 2024

Try to use lower version of ZFS. Maybe it will help.

I stuck to kernel 6.2.16-18-pve and ZFS 2.1.13-pve1 and I`m not sure then I`ll upgrade to newer kernel.

ksb · May 5, 2024

It's getting worse and worse. I reverted a couple of settings and disabled ASPM in the BIOS completely

Another crash after 10min upime, again without any logs....
30min after that, another crash, so now at #14

Status update:

#1: Disabled ARC using primarycache=none -> still crashes
#2: Set aio=threads on all VMs -> still crashes (2024-05-02, SSH possible)
#3: Set cache=none on all VMs -> still crashes (no kernel logs?, 2024-05-03, SSH not possible)
~~#4: Set intel_iommu=off -> still crashes (no kernel logs?, quite fast after only 1 hour, 2024-05-03, SSH not possible)~~
#5: Updated BIOS to 2008, 2024-05-03 (still crashes)
#6: Install intel-microcode from debian sid --> 2024-05-03 still crashes (Current revision: 0x00000122 <- Updated early from: 0x0000011f)
#7: Disable KSM -> 2024-05-03 disabled (still crashes)
#8: Deal with MSRs <-- options kvm ignore_msrs=1 report_ignored_msrs=0 set in kvm.conf (still crashes)
#9: go back to kernel 6.5 and leave all the modifications in place (still crashes)
#10: Set pcie_aspm=off and pcie_port_pm=off (still crashes)
#11: Set intel_idle.max_cstate=0 and processor.max_cstate=1 (still crashes)
#12: Set intel_pstate=disable (still crashes)
#13: Turn off APST using nvme_core.default_ps_max_latency_us=0 (still crashes)
#14: Disable GPU Power Management via i915.enable_dc=0 (still crashes)
#15: /sys/block/nvmeXn1/queue/scheduler from none to mq-deadline (still crashes)
#16: Lower the RAM from DDR5-4400 to DDR5-4200 (still crashes)
-> #17: Revert some of the changes, disable ASPM in the BIOS
#18: maybe try mitigations=off -> pending

ksb · May 6, 2024

First, I was convinced that the hardware must be okay, because it was already changed in March 2024.

In March, I also experienced random crashes, but the server was still running ESXi, which isn't officially supported on this hardware.
So, I decided to migrate to Proxmox. I had crashes while installing Proxmox, so the hardware was changed, and I had no more issues with this server until last week.

Finally, after another complete hardware change yesterday, I had no more crashes since more than 16 hours.

So maybe there is an issue with the Asus Pro WS W680 Boards, with the chipset and/or with the i9-13900 (non-K) if the hardware is getting older.

Status update:

~~#1: Disabled ARC using primarycache=none -> still crashes~~
#2: Set aio=threads on all VMs -> still crashes (2024-05-02, SSH possible)
#3: Set cache=none on all VMs -> still crashes (no kernel logs?, 2024-05-03, SSH not possible)
~~#4: Set intel_iommu=off -> still crashes (no kernel logs?, quite fast after only 1 hour, 2024-05-03, SSH not possible)~~
#5: Updated BIOS to 2008, 2024-05-03 (still crashes)
#6: Install intel-microcode from debian sid --> 2024-05-03 still crashes (Current revision: 0x00000122 <- Updated early from: 0x0000011f)
#7: Disable KSM -> 2024-05-03 disabled (still crashes)
#8: Deal with MSRs <-- options kvm ignore_msrs=1 report_ignored_msrs=0 set in kvm.conf (still crashes)
#9: go back to kernel 6.5 and leave all the modifications in place (still crashes)
#10: Set pcie_aspm=off and pcie_port_pm=off (still crashes)
#11: Set intel_idle.max_cstate=0 and processor.max_cstate=1 (still crashes)
#12: Set intel_pstate=disable (still crashes)
#13: Turn off APST using nvme_core.default_ps_max_latency_us=0 (still crashes)
#14: Disable GPU Power Management via i915.enable_dc=0 (still crashes)
#15: /sys/block/nvmeXn1/queue/scheduler from none to mq-deadline (still crashes)
#16: Lower the RAM from DDR5-4400 to DDR5-4200 (still crashes)
#17: Revert some of the changes, disable ASPM in the BIOS (still crashes)
-> #18: Let Hetzner change the complete hardware, revert most of the changes <- working

BIOS is 2008, standard intel-microcode package from debian stable, kernel is now 6.8.4-2-pve with the initial used cmdline:

Code:

pcie_aspm.policy=performance split_lock_detect=off

benyamin · May 8, 2024

@ksb, glad to see it appears to be working.

Did you find a benefit to using cache=none, aio=threads and iothread=1 despite the change in hardware?

ksb · May 8, 2024

benyamin said:
@ksb, glad to see it appears to be working.

Did you find a benefit to using cache=none, aio=threads and iothread=1 despite the change in hardware?

No, I didn't find a benefit. I think because it is NVMe only, it doesn't make a huge difference in my environment.

jpiszcz · Jul 6, 2024

ksb said:
First, I was convinced that the hardware must be okay, because it was already changed in March 2024.

In March, I also experienced random crashes, but the server was still running ESXi, which isn't officially supported on this hardware.
So, I decided to migrate to Proxmox. I had crashes while installing Proxmox, so the hardware was changed, and I had no more issues with this server until last week.

Finally, after another complete hardware change yesterday, I had no more crashes since more than 16 hours.

So maybe there is an issue with the Asus Pro WS W680 Boards, with the chipset and/or with the i9-13900 (non-K) if the hardware is getting older.

Status update:

~~#1: Disabled ARC using primarycache=none -> still crashes~~
#2: Set aio=threads on all VMs -> still crashes (2024-05-02, SSH possible)
#3: Set cache=none on all VMs -> still crashes (no kernel logs?, 2024-05-03, SSH not possible)
~~#4: Set intel_iommu=off -> still crashes (no kernel logs?, quite fast after only 1 hour, 2024-05-03, SSH not possible)~~
#5: Updated BIOS to 2008, 2024-05-03 (still crashes)
#6: Install intel-microcode from debian sid --> 2024-05-03 still crashes (Current revision: 0x00000122 <- Updated early from: 0x0000011f)
#7: Disable KSM -> 2024-05-03 disabled (still crashes)
#8: Deal with MSRs <-- options kvm ignore_msrs=1 report_ignored_msrs=0 set in kvm.conf (still crashes)
#9: go back to kernel 6.5 and leave all the modifications in place (still crashes)
#10: Set pcie_aspm=off and pcie_port_pm=off (still crashes)
#11: Set intel_idle.max_cstate=0 and processor.max_cstate=1 (still crashes)
#12: Set intel_pstate=disable (still crashes)
#13: Turn off APST using nvme_core.default_ps_max_latency_us=0 (still crashes)
#14: Disable GPU Power Management via i915.enable_dc=0 (still crashes)
#15: /sys/block/nvmeXn1/queue/scheduler from none to mq-deadline (still crashes)
#16: Lower the RAM from DDR5-4400 to DDR5-4200 (still crashes)
#17: Revert some of the changes, disable ASPM in the BIOS (still crashes)
-> #18: Let Hetzner change the complete hardware, revert most of the changes <- working

BIOS is 2008, standard intel-microcode package from debian stable, kernel is now 6.8.4-2-pve with the initial used cmdline:

Code:

pcie_aspm.policy=performance split_lock_detect=off

Fantastic troubleshooting thread! I am using a Asus Pro WS W680-ACE Board with an Intel i9-14900k, everything was rock solid for the first 3-4+ months and then now I am getting random segmentation faults, kernel panic, ruled out RAM/memtest. Appears to be CPU related.

When you swapped out the hardware, what did you move to?

ksb · Jul 7, 2024

jpiszcz said:
Fantastic troubleshooting thread! I am using a Asus Pro WS W680-ACE Board with an Intel i9-14900k, everything was rock solid for the first 3-4+ months and then now I am getting random segmentation faults, kernel panic, ruled out RAM/memtest. Appears to be CPU related.

When you swapped out the hardware, what did you move to?

Hetzner replaced the whole server (1:1 with the same components).

logics · Dec 15, 2024

Hello fellow Intel 13900 user. I already had a similar problem on one server on November 27th 2024 (https://forum.proxmox.com/threads/k...m_cache_alloc-0x37b-0x380.158134/#post-727767), now I got another problem on a different server, same hardware (Hetzner EX101, Intel i9-13900, ASRockRack W680D4U-1L, EEC memory, NVMe drives) on December 11th 2024.

It could not log on `/var/log/syslog` but sent the logs successfully to NewRelic (log server etc.)

I got a very similar crash here (same PANIC at arc.c:XXXX:arc_write_done()):

Code:

2024-12-11T18:04:20.670628+01:00 13900HostFsn kernel: [4390751.721029] PANIC at arc.c:6622:arc_write_done()
2024-12-11T18:04:20.670629+01:00 13900HostFsn kernel: [4390751.721031] Showing stack for process 3503530
2024-12-11T18:04:20.670629+01:00 13900HostFsn kernel: [4390751.721032] CPU: 8 PID: 3503530 Comm: z_wr_int_4 Tainted: P    B   W  O       6.8.12-2-pve #1
2024-12-11T18:04:20.670630+01:00 13900HostFsn kernel: [4390751.721034] Hardware name: Hetzner /W680D4U-1L, BIOS 10.28 03/10/2023
2024-12-11T18:04:20.670630+01:00 13900HostFsn kernel: [4390751.721036] Call Trace:
2024-12-11T18:04:20.670630+01:00 13900HostFsn kernel: [4390751.721037]  <TASK>
2024-12-11T18:04:20.670631+01:00 13900HostFsn kernel: [4390751.721040]  dump_stack_lvl+0x76/0xa0
2024-12-11T18:04:20.670631+01:00 13900HostFsn kernel: [4390751.721046]  dump_stack+0x10/0x20
2024-12-11T18:04:20.670632+01:00 13900HostFsn kernel: [4390751.721049]  spl_dumpstack+0x29/0x40 [spl]
2024-12-11T18:04:20.670632+01:00 13900HostFsn kernel: [4390751.721057]  spl_panic+0xfc/0x120 [spl]
2024-12-11T18:04:20.670633+01:00 13900HostFsn kernel: [4390751.721065]  arc_write_done+0x44f/0x550 [zfs]
2024-12-11T18:04:20.670633+01:00 13900HostFsn kernel: [4390751.721192]  zio_done+0x289/0x10b0 [zfs]
2024-12-11T18:04:20.670633+01:00 13900HostFsn kernel: [4390751.721307]  zio_execute+0x88/0x130 [zfs]
2024-12-11T18:04:20.670649+01:00 13900HostFsn kernel: [4390751.721434]  taskq_thread+0x27f/0x4c0 [spl]
2024-12-11T18:04:20.670650+01:00 13900HostFsn kernel: [4390751.721445]  ? __pfx_default_wake_function+0x10/0x10
2024-12-11T18:04:20.670650+01:00 13900HostFsn kernel: [4390751.721452]  ? __pfx_zio_execute+0x10/0x10 [zfs]
2024-12-11T18:04:20.670650+01:00 13900HostFsn kernel: [4390751.721597]  ? __pfx_taskq_thread+0x10/0x10 [spl]
2024-12-11T18:04:20.670650+01:00 13900HostFsn kernel: [4390751.721609]  kthread+0xef/0x120
2024-12-11T18:04:20.670651+01:00 13900HostFsn kernel: [4390751.721613]  ? __pfx_kthread+0x10/0x10
2024-12-11T18:04:20.670660+01:00 13900HostFsn kernel: [4390751.721615]  ret_from_fork+0x44/0x70
2024-12-11T18:04:20.670660+01:00 13900HostFsn kernel: [4390751.721619]  ? __pfx_kthread+0x10/0x10
2024-12-11T18:04:20.670661+01:00 13900HostFsn kernel: [4390751.721621]  ret_from_fork_asm+0x1b/0x30
2024-12-11T18:04:20.670661+01:00 13900HostFsn kernel: [4390751.721624]  </TASK>
2024-12-11T18:04:20.670611+01:00 13900HostFsn kernel: [4390751.721023] VERIFY3(remove_reference(hdr, hdr) > 0) failed (0 > 0)

system is PVE 8.2.7 (kernel 6.8.12-2-pve but that's the kernel version after reboot and I don't know if before the reboot an older kernel had been loaded).

Crash related are the the following logs then:

Code:

2024-12-11T18:05:02.325363+01:00 13900HostFsn vzdump[3504053]: INFO: starting new backup job: vzdump 101 --storage local101 --mode snapshot --mailto XXX@YYY.de --quiet 1 --mailnotification failure
2024-12-11T18:05:02.277457+01:00 13900HostFsn vzdump[3504052]: <root@pam> starting task UPID:13900HostFsn:003577B5:1A2BE590:6759C63E:vzdump:101:root@pam:
[...]
2024-12-11T18:07:33.185495+01:00 13900HostFsn kernel: [4390944.233463]  dmu_tx_count_write+0xe2/0x1d0 [zfs]
2024-12-11T18:07:33.185495+01:00 13900HostFsn kernel: [4390944.233588]  ? dmu_tx_hold_dnode_impl+0x57/0x130 [zfs]
2024-12-11T18:07:33.185495+01:00 13900HostFsn kernel: [4390944.233713]  dmu_tx_hold_write_by_dnode+0x3a/0x60 [zfs]
2024-12-11T18:07:33.185495+01:00 13900HostFsn kernel: [4390944.233840]  zvol_write+0x223/0x670 [zfs]
2024-12-11T18:07:33.185496+01:00 13900HostFsn kernel: [4390944.233945]  zvol_write_task+0x12/0x30 [zfs]
2024-12-11T18:07:33.185496+01:00 13900HostFsn kernel: [4390944.234050]  taskq_thread+0x27f/0x4c0 [spl]
2024-12-11T18:07:33.185496+01:00 13900HostFsn kernel: [4390944.234060]  ? __pfx_default_wake_function+0x10/0x10
2024-12-11T18:07:33.185496+01:00 13900HostFsn kernel: [4390944.234063]  ? __pfx_zvol_write_task+0x10/0x10 [zfs]
2024-12-11T18:07:33.185496+01:00 13900HostFsn kernel: [4390944.234167]  ? __pfx_taskq_thread+0x10/0x10 [spl]
2024-12-11T18:07:33.185497+01:00 13900HostFsn kernel: [4390944.234177]  kthread+0xef/0x120
2024-12-11T18:07:33.185497+01:00 13900HostFsn kernel: [4390944.234179]  ? __pfx_kthread+0x10/0x10
2024-12-11T18:07:33.185497+01:00 13900HostFsn kernel: [4390944.234181]  ret_from_fork+0x44/0x70
2024-12-11T18:07:33.185497+01:00 13900HostFsn kernel: [4390944.234183]  ? __pfx_kthread+0x10/0x10
2024-12-11T18:07:33.185497+01:00 13900HostFsn kernel: [4390944.234185]  ret_from_fork_asm+0x1b/0x30
2024-12-11T18:07:33.185497+01:00 13900HostFsn kernel: [4390944.234187]  </TASK>
2024-12-11T18:07:33.185494+01:00 13900HostFsn kernel: [4390944.233337]  dmu_tx_check_ioerr+0x61/0x110 [zfs]
2024-12-11T18:07:33.184541+01:00 13900HostFsn kernel: [4390944.232430]  ? __pfx_taskq_thread+0x10/0x10 [spl]
2024-12-11T18:07:33.184541+01:00 13900HostFsn kernel: [4390944.232439]  kthread+0xef/0x120
2024-12-11T18:07:33.184541+01:00 13900HostFsn kernel: [4390944.232441]  ? __pfx_kthread+0x10/0x10
2024-12-11T18:07:33.184542+01:00 13900HostFsn kernel: [4390944.232443]  ret_from_fork+0x44/0x70
2024-12-11T18:07:33.184542+01:00 13900HostFsn kernel: [4390944.232446]  ? __pfx_kthread+0x10/0x10
2024-12-11T18:07:33.184542+01:00 13900HostFsn kernel: [4390944.232448]  ret_from_fork_asm+0x1b/0x30
2024-12-11T18:07:33.184542+01:00 13900HostFsn kernel: [4390944.232450]  </TASK>
2024-12-11T18:07:33.184543+01:00 13900HostFsn kernel: [4390944.232454] INFO: task z_wr_int_4:3503530 blocked for more than 122 seconds.
2024-12-11T18:07:33.184543+01:00 13900HostFsn kernel: [4390944.232455]       Tainted: P    B   W  O       6.8.12-2-pve #1
2024-12-11T18:07:33.184543+01:00 13900HostFsn kernel: [4390944.232457] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2024-12-11T18:07:33.184543+01:00 13900HostFsn kernel: [4390944.232459] task:z_wr_int_4      state:D stack:0     pid:3503530 tgid:3503530 ppid:2      flags:0x00004000
2024-12-11T18:07:33.184543+01:00 13900HostFsn kernel: [4390944.232461] Call Trace:
2024-12-11T18:07:33.184544+01:00 13900HostFsn kernel: [4390944.232463]  <TASK>
2024-12-11T18:07:33.184544+01:00 13900HostFsn kernel: [4390944.232464]  __schedule+0x401/0x15e0
2024-12-11T18:07:33.184544+01:00 13900HostFsn kernel: [4390944.232467]  schedule+0x33/0x110
2024-12-11T18:07:33.184544+01:00 13900HostFsn kernel: [4390944.232469]  spl_panic+0x112/0x120 [spl]
2024-12-11T18:07:33.184544+01:00 13900HostFsn kernel: [4390944.232480]  arc_write_done+0x44f/0x550 [zfs]
2024-12-11T18:07:33.184544+01:00 13900HostFsn kernel: [4390944.232594]  zio_done+0x289/0x10b0 [zfs]
2024-12-11T18:07:33.184544+01:00 13900HostFsn kernel: [4390944.232706]  zio_execute+0x88/0x130 [zfs]
2024-12-11T18:07:33.184545+01:00 13900HostFsn kernel: [4390944.232819]  taskq_thread+0x27f/0x4c0 [spl]
2024-12-11T18:07:33.184545+01:00 13900HostFsn kernel: [4390944.232828]  ? __pfx_default_wake_function+0x10/0x10
2024-12-11T18:07:33.184545+01:00 13900HostFsn kernel: [4390944.232831]  ? __pfx_zio_execute+0x10/0x10 [zfs]
2024-12-11T18:07:33.184545+01:00 13900HostFsn kernel: [4390944.232944]  ? __pfx_taskq_thread+0x10/0x10 [spl]
2024-12-11T18:07:33.184545+01:00 13900HostFsn kernel: [4390944.232953]  kthread+0xef/0x120
2024-12-11T18:07:33.184545+01:00 13900HostFsn kernel: [4390944.232955]  ? __pfx_kthread+0x10/0x10
2024-12-11T18:07:33.184546+01:00 13900HostFsn kernel: [4390944.232957]  ret_from_fork+0x44/0x70
2024-12-11T18:07:33.184546+01:00 13900HostFsn kernel: [4390944.232959]  ? __pfx_kthread+0x10/0x10
2024-12-11T18:07:33.184546+01:00 13900HostFsn kernel: [4390944.232961]  ret_from_fork_asm+0x1b/0x30
2024-12-11T18:07:33.184546+01:00 13900HostFsn kernel: [4390944.232963]  </TASK>
2024-12-11T18:07:33.184546+01:00 13900HostFsn kernel: [4390944.232965] INFO: task zvol_tq-2:3503540 blocked for more than 122 seconds.
2024-12-11T18:07:33.184546+01:00 13900HostFsn kernel: [4390944.232966]       Tainted: P    B   W  O       6.8.12-2-pve #1
2024-12-11T18:07:33.184547+01:00 13900HostFsn kernel: [4390944.232968] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2024-12-11T18:07:33.184547+01:00 13900HostFsn kernel: [4390944.232969] task:zvol_tq-2       state:D stack:0     pid:3503540 tgid:3503540 ppid:2      flags:0x00004000
2024-12-11T18:07:33.184547+01:00 13900HostFsn kernel: [4390944.232972] Call Trace:
2024-12-11T18:07:33.184547+01:00 13900HostFsn kernel: [4390944.232973]  <TASK>
2024-12-11T18:07:33.184547+01:00 13900HostFsn kernel: [4390944.232974]  __schedule+0x401/0x15e0
2024-12-11T18:07:33.184552+01:00 13900HostFsn kernel: [4390944.232977]  schedule+0x33/0x110
2024-12-11T18:07:33.184552+01:00 13900HostFsn kernel: [4390944.232980]  schedule_preempt_disabled+0x15/0x30
2024-12-11T18:07:33.184553+01:00 13900HostFsn kernel: [4390944.232982]  __mutex_lock.constprop.0+0x3f8/0x7a0
2024-12-11T18:07:33.184553+01:00 13900HostFsn kernel: [4390944.232985]  __mutex_lock_slowpath+0x13/0x20
2024-12-11T18:07:33.184553+01:00 13900HostFsn kernel: [4390944.232987]  mutex_lock+0x3c/0x50
2024-12-11T18:07:33.184553+01:00 13900HostFsn kernel: [4390944.232989]  arc_buf_access+0x6f/0x1c0 [zfs]
2024-12-11T18:07:33.184553+01:00 13900HostFsn kernel: [4390944.233104]  dbuf_hold_impl+0x9a/0x730 [zfs]
2024-12-11T18:07:33.184554+01:00 13900HostFsn kernel: [4390944.233223]  ? zio_create+0x3e8/0x660 [zfs]
2024-12-11T18:07:33.184540+01:00 13900HostFsn kernel: [4390944.232316]  ? __pfx_zio_execute+0x10/0x10 [zfs]
2024-12-11T18:07:33.183585+01:00 13900HostFsn kernel: [4390944.231453]  zio_done+0x289/0x10b0 [zfs]
2024-12-11T18:07:33.183586+01:00 13900HostFsn kernel: [4390944.231566]  zio_execute+0x88/0x130 [zfs]
2024-12-11T18:07:33.183586+01:00 13900HostFsn kernel: [4390944.231678]  taskq_thread+0x27f/0x4c0 [spl]
2024-12-11T18:07:33.183587+01:00 13900HostFsn kernel: [4390944.231688]  ? __pfx_default_wake_function+0x10/0x10
2024-12-11T18:07:33.183587+01:00 13900HostFsn kernel: [4390944.231691]  ? __pfx_zio_execute+0x10/0x10 [zfs]
2024-12-11T18:07:33.183587+01:00 13900HostFsn kernel: [4390944.231804]  ? __pfx_taskq_thread+0x10/0x10 [spl]
2024-12-11T18:07:33.183587+01:00 13900HostFsn kernel: [4390944.231813]  kthread+0xef/0x120
2024-12-11T18:07:33.183588+01:00 13900HostFsn kernel: [4390944.231815]  ? __pfx_kthread+0x10/0x10
2024-12-11T18:07:33.183588+01:00 13900HostFsn kernel: [4390944.231817]  ret_from_fork+0x44/0x70
2024-12-11T18:07:33.183588+01:00 13900HostFsn kernel: [4390944.231820]  ? __pfx_kthread+0x10/0x10
2024-12-11T18:07:33.183588+01:00 13900HostFsn kernel: [4390944.231821]  ret_from_fork_asm+0x1b/0x30
2024-12-11T18:07:33.183589+01:00 13900HostFsn kernel: [4390944.231824]  </TASK>
2024-12-11T18:07:33.183589+01:00 13900HostFsn kernel: [4390944.231826] INFO: task z_wr_int_4:3503235 blocked for more than 122 seconds.
2024-12-11T18:07:33.183589+01:00 13900HostFsn kernel: [4390944.231828]       Tainted: P    B   W  O       6.8.12-2-pve #1
2024-12-11T18:07:33.183589+01:00 13900HostFsn kernel: [4390944.231829] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2024-12-11T18:07:33.183590+01:00 13900HostFsn kernel: [4390944.231831] task:z_wr_int_4      state:D stack:0     pid:3503235 tgid:3503235 ppid:2      flags:0x00004000
[...]

edit: attached full logs

Anyhow since this is my production server I will no longer rely on Intel 13900 CPUs and switch to AMD for a while...

Search

Search

Random freezes, maybe ZFS related

ksb

Member

ksb

Member

ksb

Member

ksb

Member

ksb

Member

ksb

Member

Nemesiz

Renowned Member

ksb

Member

ksb

Member

benyamin

Member

ksb

Member

jpiszcz

New Member

ksb

Member

logics

Well-Known Member

Attachments

We value your privacy