Random freezes, maybe ZFS related

Status update:

#1: Disabled ARC using primarycache=none -> still crashes
#2: Set aio=threads on all VMs -> still crashes (2024-05-02, SSH possible)
#3: Set cache=none on all VMs -> still crashes (no kernel logs?, 2024-05-03, SSH not possible)
#4: Set intel_iommu=off -> still crashes (no kernel logs?, quite fast after only 1 hour, 2024-05-03, SSH not possible)
#5: Updated BIOS to 2008, 2024-05-03 (still crashes)
#6: Install intel-microcode from debian sid --> 2024-05-03 still crashes (Current revision: 0x00000122 <- Updated early from: 0x0000011f)

#7: Disable KSM -> 2024-05-03 disabled
#8: Deal with MSRs <-- options kvm ignore_msrs=1 report_ignored_msrs=0 set in kvm.conf
#9: go back to kernel 6.5 but leave all the modifications in place -> still crashes

--> #10: Set pcie_aspm=off and pcie_port_pm=off
#11: Set intel_idle.max_cstate=0 and processor.max_cstate=1 -> pending
#12: Set intel_pstate=disable -> pending
#13: Turn off APST using nvme_core.default_ps_max_latency_us=0 -> pending
#14: maybe /sys/block/nvmeXn1/queue/scheduler from none to mq-deadline -> pending
#15: maybe try mitigations=off -> pending
#16: maybe lower the RAM from DDR5-4400 to DDR5-4200
 
Last edited:
Another crash

Code:
May 04 18:03:03 srv02 kernel: VERIFY3(remove_reference(hdr, hdr) > 0) failed (0 > 0)
May 04 18:03:03 srv02 kernel: PANIC at arc.c:6610:arc_write_done()
May 04 18:03:03 srv02 kernel: Showing stack for process 785
May 04 18:03:03 srv02 kernel: CPU: 28 PID: 785 Comm: z_wr_int_2 Tainted: P           O       6.5.13-5-pve #1
May 04 18:03:03 srv02 kernel: Hardware name: ASUSTeK COMPUTER INC. System Product Name/W680/MB DC, BIOS 2008 03/20/2024
May 04 18:03:03 srv02 kernel: Call Trace:
May 04 18:03:03 srv02 kernel:  <TASK>
May 04 18:03:03 srv02 kernel:  dump_stack_lvl+0x48/0x70
May 04 18:03:03 srv02 kernel:  dump_stack+0x10/0x20
May 04 18:03:03 srv02 kernel:  spl_dumpstack+0x29/0x40 [spl]
May 04 18:03:03 srv02 kernel:  spl_panic+0xfc/0x120 [spl]
May 04 18:03:03 srv02 kernel:  arc_write_done+0x44f/0x550 [zfs]
May 04 18:03:03 srv02 kernel:  ? mutex_lock+0x12/0x50
May 04 18:03:03 srv02 kernel:  zio_done+0x289/0x10b0 [zfs]
May 04 18:03:03 srv02 kernel:  ? kfree+0x78/0x120
May 04 18:03:03 srv02 kernel:  zio_execute+0x88/0x130 [zfs]
May 04 18:03:03 srv02 kernel:  taskq_thread+0x27f/0x490 [spl]
May 04 18:03:03 srv02 kernel:  ? __pfx_default_wake_function+0x10/0x10
May 04 18:03:03 srv02 kernel:  ? __pfx_zio_execute+0x10/0x10 [zfs]
May 04 18:03:03 srv02 kernel:  ? __pfx_taskq_thread+0x10/0x10 [spl]
May 04 18:03:03 srv02 kernel:  kthread+0xef/0x120
May 04 18:03:03 srv02 kernel:  ? __pfx_kthread+0x10/0x10
May 04 18:03:03 srv02 kernel:  ret_from_fork+0x44/0x70
May 04 18:03:03 srv02 kernel:  ? __pfx_kthread+0x10/0x10
May 04 18:03:03 srv02 kernel:  ret_from_fork_asm+0x1b/0x30
May 04 18:03:03 srv02 kernel:  </TASK>
May 04 18:03:17 srv02 kernel: BUG: kernel NULL pointer dereference, address: 0000000000000060
May 04 18:03:17 srv02 kernel: #PF: supervisor read access in kernel mode
May 04 18:03:17 srv02 kernel: #PF: error_code(0x0000) - not-present page
May 04 18:03:17 srv02 kernel: PGD 0 P4D 0
May 04 18:03:17 srv02 kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
May 04 18:03:17 srv02 kernel: CPU: 4 PID: 3511 Comm: zvol Tainted: P           O       6.5.13-5-pve #1
May 04 18:03:17 srv02 kernel: Hardware name: ASUSTeK COMPUTER INC. System Product Name/W680/MB DC, BIOS 2008 03/20/2024
May 04 18:03:17 srv02 kernel: RIP: 0010:arc_buf_access+0x15/0x1c0 [zfs]
May 04 18:03:17 srv02 kernel: Code: 00 00 00 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54 53 48 8b 1f <48> 81 7b 60 40 92 c6 c0 0f 84 f5 00 00 00 48 8b 33 48 8b 53 08 48
May 04 18:03:17 srv02 kernel: RSP: 0018:ffff9f7e8ddabb90 EFLAGS: 00010286
May 04 18:03:17 srv02 kernel: RAX: ffff9455646eb720 RBX: 0000000000000000 RCX: 0000000000000000
May 04 18:03:17 srv02 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9454dfd7c960
May 04 18:03:17 srv02 kernel: RBP: ffff9f7e8ddabbb8 R08: 0000000000000000 R09: 0000000000000000
May 04 18:03:17 srv02 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
May 04 18:03:17 srv02 kernel: R13: ffff9455646eb720 R14: 0000000000000000 R15: ffff9f7e8ddabc48
May 04 18:03:17 srv02 kernel: FS:  0000000000000000(0000) GS:ffff9460bf100000(0000) knlGS:0000000000000000
May 04 18:03:17 srv02 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 04 18:03:17 srv02 kernel: CR2: 0000000000000060 CR3: 00000004e2856000 CR4: 0000000000752ee0
May 04 18:03:17 srv02 kernel: PKRU: 55555554
May 04 18:03:17 srv02 kernel: Call Trace:
May 04 18:03:17 srv02 kernel:  <TASK>
May 04 18:03:17 srv02 kernel:  ? show_regs+0x6d/0x80
May 04 18:03:17 srv02 kernel:  ? __die+0x24/0x80
May 04 18:03:17 srv02 kernel:  ? page_fault_oops+0x176/0x500
May 04 18:03:17 srv02 kernel:  ? do_user_addr_fault+0x31d/0x6a0
May 04 18:03:17 srv02 kernel:  ? exc_page_fault+0x83/0x1b0
May 04 18:03:17 srv02 kernel:  ? asm_exc_page_fault+0x27/0x30
May 04 18:03:17 srv02 kernel:  ? arc_buf_access+0x15/0x1c0 [zfs]
May 04 18:03:17 srv02 kernel:  dbuf_hold_impl+0x9a/0x730 [zfs]
May 04 18:03:17 srv02 kernel:  ? zio_create+0x3e8/0x660 [zfs]
May 04 18:03:17 srv02 kernel:  dmu_tx_check_ioerr+0x61/0x110 [zfs]
May 04 18:03:17 srv02 kernel:  dmu_tx_count_write+0xe2/0x1d0 [zfs]
May 04 18:03:17 srv02 kernel:  ? dmu_tx_hold_dnode_impl+0x57/0x130 [zfs]
May 04 18:03:17 srv02 kernel:  dmu_tx_hold_write_by_dnode+0x3a/0x60 [zfs]
May 04 18:03:17 srv02 kernel:  zvol_write+0x226/0x680 [zfs]
May 04 18:03:17 srv02 kernel:  zvol_write_task+0x12/0x30 [zfs]
May 04 18:03:17 srv02 kernel:  taskq_thread+0x27f/0x490 [spl]
May 04 18:03:17 srv02 kernel:  ? __pfx_default_wake_function+0x10/0x10
May 04 18:03:17 srv02 kernel:  ? __pfx_zvol_write_task+0x10/0x10 [zfs]
May 04 18:03:17 srv02 kernel:  ? __pfx_taskq_thread+0x10/0x10 [spl]
May 04 18:03:17 srv02 kernel:  kthread+0xef/0x120
May 04 18:03:17 srv02 kernel:  ? __pfx_kthread+0x10/0x10
May 04 18:03:17 srv02 kernel:  ret_from_fork+0x44/0x70
May 04 18:03:17 srv02 kernel:  ? __pfx_kthread+0x10/0x10
May 04 18:03:17 srv02 kernel:  ret_from_fork_asm+0x1b/0x30
May 04 18:03:17 srv02 kernel:  </TASK>
May 04 18:03:17 srv02 kernel: Modules linked in: tcp_diag inet_diag cmac nls_utf8 cifs cifs_arc4 rdma_cm iw_cm ib_cm ib_core cifs_md4 fscache netfs ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter bpfilter nf_tables bonding tls softdog sunrpc binfmt_misc nfnetlink_log nfnetlink intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common intel_tcc_cooling x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel pmt_telemetry i915 mei_hdcp mei_pxp pmt_class kvm drm_buddy irqbypass ttm crct10dif_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha256_ssse3 sha1_ssse3 aesni_intel drm_display_helper eeepc_wmi asus_wmi crypto_simd ledtrig_audio cec cryptd sparse_keymap rapl intel_cstate cmdlinepart platform_profile spi_nor rc_core serio_raw wmi_bmof mei_me pcspkr drm_kms_helper mtd mei i2c_algo_bit intel_vsec acpi_pad acpi_tad joydev input_leds mac_hid vhost_net vhost vhost_iotlb tap drm efi_pstore dmi_sysfs ip_tables x_tables autofs4 zfs(PO) spl(O) btrfs
May 04 18:03:17 srv02 kernel:  blake2b_generic xor raid6_pq libcrc32c simplefb hid_generic usbmouse usbkbd usbhid hid xhci_pci nvme xhci_pci_renesas crc32_pclmul igc spi_intel_pci i2c_i801 nvme_core xhci_hcd spi_intel i2c_smbus intel_lpss_pci ahci nvme_common intel_lpss libahci idma64 video wmi pinctrl_alderlake
May 04 18:03:17 srv02 kernel: CR2: 0000000000000060
May 04 18:03:17 srv02 kernel: ---[ end trace 0000000000000000 ]---
May 04 18:03:17 srv02 kernel: RIP: 0010:arc_buf_access+0x15/0x1c0 [zfs]
May 04 18:03:17 srv02 kernel: Code: 00 00 00 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54 53 48 8b 1f <48> 81 7b 60 40 92 c6 c0 0f 84 f5 00 00 00 48 8b 33 48 8b 53 08 48
May 04 18:03:17 srv02 kernel: RSP: 0018:ffff9f7e8ddabb90 EFLAGS: 00010286
May 04 18:03:17 srv02 kernel: RAX: ffff9455646eb720 RBX: 0000000000000000 RCX: 0000000000000000
May 04 18:03:17 srv02 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9454dfd7c960
May 04 18:03:17 srv02 kernel: RBP: ffff9f7e8ddabbb8 R08: 0000000000000000 R09: 0000000000000000
May 04 18:03:17 srv02 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
May 04 18:03:17 srv02 kernel: R13: ffff9455646eb720 R14: 0000000000000000 R15: ffff9f7e8ddabc48
May 04 18:03:17 srv02 kernel: FS:  0000000000000000(0000) GS:ffff9460bf100000(0000) knlGS:0000000000000000
May 04 18:03:17 srv02 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 04 18:03:17 srv02 kernel: CR2: 0000000000000060 CR3: 00000004e2856000 CR4: 0000000000752ee0
May 04 18:03:17 srv02 kernel: PKRU: 55555554
May 04 18:03:17 srv02 kernel: note: zvol[3511] exited with irqs disabled
 
Status update:

#1: Disabled ARC using primarycache=none -> still crashes
#2: Set aio=threads on all VMs -> still crashes (2024-05-02, SSH possible)
#3: Set cache=none on all VMs -> still crashes (no kernel logs?, 2024-05-03, SSH not possible)
#4: Set intel_iommu=off -> still crashes (no kernel logs?, quite fast after only 1 hour, 2024-05-03, SSH not possible)
#5: Updated BIOS to 2008, 2024-05-03 (still crashes)
#6: Install intel-microcode from debian sid --> 2024-05-03 still crashes (Current revision: 0x00000122 <- Updated early from: 0x0000011f)

#7: Disable KSM -> 2024-05-03 disabled
#8: Deal with MSRs <-- options kvm ignore_msrs=1 report_ignored_msrs=0 set in kvm.conf
#9: go back to kernel 6.5 but leave all the modifications in place -> still crashes
#10: Set pcie_aspm=off and pcie_port_pm=off -> still crashes

--> #11: Set intel_idle.max_cstate=0 and processor.max_cstate=1

#12: Set intel_pstate=disable -> pending
#13: Turn off APST using nvme_core.default_ps_max_latency_us=0 -> pending
#14: maybe /sys/block/nvmeXn1/queue/scheduler from none to mq-deadline -> pending
#15: maybe try mitigations=off -> pending
#16: maybe lower the RAM from DDR5-4400 to DDR5-4200
 
New crash but without any logs....

Status update:

#1: Disabled ARC using primarycache=none -> still crashes
#2: Set aio=threads on all VMs -> still crashes (2024-05-02, SSH possible)
#3: Set cache=none on all VMs -> still crashes (no kernel logs?, 2024-05-03, SSH not possible)
#4: Set intel_iommu=off -> still crashes (no kernel logs?, quite fast after only 1 hour, 2024-05-03, SSH not possible)
#5: Updated BIOS to 2008, 2024-05-03 (still crashes)
#6: Install intel-microcode from debian sid --> 2024-05-03 still crashes (Current revision: 0x00000122 <- Updated early from: 0x0000011f)

#7: Disable KSM -> 2024-05-03 disabled
#8: Deal with MSRs <-- options kvm ignore_msrs=1 report_ignored_msrs=0 set in kvm.conf
#9: go back to kernel 6.5 but leave all the modifications in place -> still crashes
#10: Set pcie_aspm=off and pcie_port_pm=off -> still crashes
#11: Set intel_idle.max_cstate=0 and processor.max_cstate=1

--> #12: Set intel_pstate=disable
#13: Turn off APST using nvme_core.default_ps_max_latency_us=0 -> pending
#14: maybe /sys/block/nvmeXn1/queue/scheduler from none to mq-deadline -> pending
#15: maybe try mitigations=off -> pending
#16: maybe lower the RAM from DDR5-4400 to DDR5-4200 -> pending
 
Another crash after 10min upime, again without any logs....
30min after that, another crash, so now at #14

Status update:

#1: Disabled ARC using primarycache=none -> still crashes
#2: Set aio=threads on all VMs -> still crashes (2024-05-02, SSH possible)
#3: Set cache=none on all VMs -> still crashes (no kernel logs?, 2024-05-03, SSH not possible)
#4: Set intel_iommu=off -> still crashes (no kernel logs?, quite fast after only 1 hour, 2024-05-03, SSH not possible)
#5: Updated BIOS to 2008, 2024-05-03 (still crashes)
#6: Install intel-microcode from debian sid --> 2024-05-03 still crashes (Current revision: 0x00000122 <- Updated early from: 0x0000011f)
#7: Disable KSM -> 2024-05-03 disabled (still crashes)
#8: Deal with MSRs <-- options kvm ignore_msrs=1 report_ignored_msrs=0 set in kvm.conf (still crashes)
#9: go back to kernel 6.5 and leave all the modifications in place (still crashes)
#10: Set pcie_aspm=off and pcie_port_pm=off (still crashes)
#11: Set intel_idle.max_cstate=0 and processor.max_cstate=1 (still crashes)
#12: Set intel_pstate=disable (still crashes)
#13: Turn off APST using nvme_core.default_ps_max_latency_us=0 (still crashes)

--> #14: Disable GPU Power Management via i915.enable_dc=0
#15: maybe /sys/block/nvmeXn1/queue/scheduler from none to mq-deadline via elevator=mq-deadline -> pending
#16: maybe try mitigations=off -> pending
#17: maybe lower the RAM from DDR5-4400 to DDR5-4200 -> pending
 
Last edited:
Try to use lower version of ZFS. Maybe it will help.

I stuck to kernel 6.2.16-18-pve and ZFS 2.1.13-pve1 and I`m not sure then I`ll upgrade to newer kernel.
 
It's getting worse and worse. I reverted a couple of settings and disabled ASPM in the BIOS completely

Another crash after 10min upime, again without any logs....
30min after that, another crash, so now at #14

Status update:

#1: Disabled ARC using primarycache=none -> still crashes
#2: Set aio=threads on all VMs -> still crashes (2024-05-02, SSH possible)
#3: Set cache=none on all VMs -> still crashes (no kernel logs?, 2024-05-03, SSH not possible)
#4: Set intel_iommu=off -> still crashes (no kernel logs?, quite fast after only 1 hour, 2024-05-03, SSH not possible)
#5: Updated BIOS to 2008, 2024-05-03 (still crashes)
#6: Install intel-microcode from debian sid --> 2024-05-03 still crashes (Current revision: 0x00000122 <- Updated early from: 0x0000011f)
#7: Disable KSM -> 2024-05-03 disabled (still crashes)
#8: Deal with MSRs <-- options kvm ignore_msrs=1 report_ignored_msrs=0 set in kvm.conf (still crashes)

#9: go back to kernel 6.5 and leave all the modifications in place (still crashes)
#10: Set pcie_aspm=off and pcie_port_pm=off (still crashes)
#11: Set intel_idle.max_cstate=0 and processor.max_cstate=1 (still crashes)
#12: Set intel_pstate=disable (still crashes)
#13: Turn off APST using nvme_core.default_ps_max_latency_us=0 (still crashes)
#14: Disable GPU Power Management via i915.enable_dc=0 (still crashes)
#15: /sys/block/nvmeXn1/queue/scheduler from none to mq-deadline (still crashes)

#16: Lower the RAM from DDR5-4400 to DDR5-4200 (still crashes)
-> #17: Revert some of the changes, disable ASPM in the BIOS
#18: maybe try mitigations=off -> pending
 
First, I was convinced that the hardware must be okay, because it was already changed in March 2024.

In March, I also experienced random crashes, but the server was still running ESXi, which isn't officially supported on this hardware.
So, I decided to migrate to Proxmox. I had crashes while installing Proxmox, so the hardware was changed, and I had no more issues with this server until last week.

Finally, after another complete hardware change yesterday, I had no more crashes since more than 16 hours.

So maybe there is an issue with the Asus Pro WS W680 Boards, with the chipset and/or with the i9-13900 (non-K) if the hardware is getting older.

Status update:

#1: Disabled ARC using primarycache=none -> still crashes
#2: Set aio=threads on all VMs -> still crashes (2024-05-02, SSH possible)
#3: Set cache=none on all VMs -> still crashes (no kernel logs?, 2024-05-03, SSH not possible)
#4: Set intel_iommu=off -> still crashes (no kernel logs?, quite fast after only 1 hour, 2024-05-03, SSH not possible)
#5: Updated BIOS to 2008, 2024-05-03 (still crashes)
#6: Install intel-microcode from debian sid --> 2024-05-03 still crashes (Current revision: 0x00000122 <- Updated early from: 0x0000011f)
#7: Disable KSM -> 2024-05-03 disabled (still crashes)
#8: Deal with MSRs <-- options kvm ignore_msrs=1 report_ignored_msrs=0 set in kvm.conf (still crashes)
#9: go back to kernel 6.5 and leave all the modifications in place (still crashes)
#10: Set pcie_aspm=off and pcie_port_pm=off (still crashes)
#11: Set intel_idle.max_cstate=0 and processor.max_cstate=1 (still crashes)
#12: Set intel_pstate=disable (still crashes)
#13: Turn off APST using nvme_core.default_ps_max_latency_us=0 (still crashes)
#14: Disable GPU Power Management via i915.enable_dc=0 (still crashes)
#15: /sys/block/nvmeXn1/queue/scheduler from none to mq-deadline (still crashes)

#16: Lower the RAM from DDR5-4400 to DDR5-4200 (still crashes)
#17: Revert some of the changes, disable ASPM in the BIOS (still crashes)
-> #18: Let Hetzner change the complete hardware, revert most of the changes <- working

BIOS is 2008, standard intel-microcode package from debian stable, kernel is now 6.8.4-2-pve with the initial used cmdline:
Code:
pcie_aspm.policy=performance split_lock_detect=off
 
@ksb, glad to see it appears to be working.

Did you find a benefit to using cache=none, aio=threads and iothread=1 despite the change in hardware?
 
@ksb, glad to see it appears to be working.

Did you find a benefit to using cache=none, aio=threads and iothread=1 despite the change in hardware?
No, I didn't find a benefit. I think because it is NVMe only, it doesn't make a huge difference in my environment.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!