Random freezes, maybe ZFS related

If you meant aio=threads, then yes. You also need iothread=1.

As for cache=none (default), I rely on what the physical disk controller provides. In my analysis, the performance improvement of also enabling various QEMU cache types was negligible. It can also make troubleshooting difficult by shifting the problem around and can cause host memory pressure and undesirable paging IO. Stefan speaks to this in https://bugzilla.kernel.org/show_bug.cgi?id=199727#c12 and https://bugzilla.kernel.org/show_bug.cgi?id=199727#c16.

Are you using any other SCSI Controller types on any other VMs, or are they all VirtIO SCSI single?

It might be helpful to drop the contents of an example VM configuration from its *.conf file @ /etc/pve/qemu-server/.
It might also be informative if you grab one of the pids of an affected kvm process from top, and drop the output of ps aux | grep <pid>.
I already had iothread=1.
All VMs have VirtIO SCSI single.

Here is an example config (the cache=none setting is from today):
Code:
agent: 1
bios: seabios
boot: order=ide0;scsi0;scsi1;scsi2;scsi3;scsi4
cores: 2
cpu: host
machine: q35
memory: 512
meta: creation-qemu=8.1.5,ctime=1711784910
name: srv11v
net0: virtio=00:50:56:91:72:cf,bridge=vmbr2,queues=8
ostype: l26
scsi0: local-zfs:vm-103-disk-0,aio=threads,cache=none,discard=on,iothread=1,size=600M,ssd=1
scsi1: local-zfs:vm-103-disk-1,aio=threads,cache=none,discard=on,iothread=1,size=4G,ssd=1
scsi2: local-zfs:vm-103-disk-2,aio=threads,cache=none,discard=on,iothread=1,size=20G,ssd=1
scsi3: local-zfs:vm-103-disk-3,aio=threads,cache=none,discard=on,iothread=1,size=6G,ssd=1
scsi4: local-zfs:vm-103-disk-4,aio=threads,cache=none,discard=on,iothread=1,size=3500M,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=564de6d5-959c-003e-e7f0-8b46ea663c35
sockets: 1
vmgenid: 91dae9ae-c84c-4c6d-8871-9e9a6af5a748

I checked the PIDs. It was completely mixed, some Linux VMs, some Windows VMs, larger VMs, small VMs.

Here is the cmdline from this VM:
Code:
/usr/bin/kvm -id 103 -name srv11v,debug-threads=on -no-shutdown -chardev socket,id=qmp,path=/var/run/qemu-server/103.qmp,server=on,wait=off -mon chardev=qmp,mode=control -chardev socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5 -mon chardev=qmp-event,mode=control -pidfile /var/run/qemu-server/103.pid -daemonize -smbios type=1,uuid=564de6d5-959c-003e-e7f0-8b46ea663c35 -smp 2,sockets=1,cores=2,maxcpus=2 -nodefaults -boot menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg -vnc unix:/var/run/qemu-server/103.vnc,password=on -cpu host,+kvm_pv_eoi,+kvm_pv_unhalt -m 512 -object iothread,id=iothread-virtioscsi0 -object iothread,id=iothread-virtioscsi1 -object iothread,id=iothread-virtioscsi2 -object iothread,id=iothread-virtioscsi3 -object iothread,id=iothread-virtioscsi4 -readconfig /usr/share/qemu-server/pve-q35-4.0.cfg -device vmgenid,guid=91dae9ae-c84c-4c6d-8871-9e9a6af5a748 -device usb-tablet,id=tablet,bus=ehci.0,port=1 -device VGA,id=vga,bus=pcie.0,addr=0x1 -chardev socket,path=/var/run/qemu-server/103.qga,server=on,wait=off,id=qga0 -device virtio-serial,id=qga0,bus=pci.0,addr=0x8 -device virtserialport,chardev=qga0,name=org.qemu.guest_agent.0 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3,free-page-reporting=on -iscsi initiator-name=iqn.1993-08.org.debian:01:297d681f98de -device virtio-scsi-pci,id=virtioscsi0,bus=pci.3,addr=0x1,iothread=iothread-virtioscsi0 -drive file=/dev/zvol/rpool/data/vm-103-disk-0,if=none,id=drive-scsi0,cache=none,aio=threads,discard=on,format=raw,detect-zeroes=unmap -device scsi-hd,bus=virtioscsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,rotation_rate=1,bootindex=101 -device virtio-scsi-pci,id=virtioscsi1,bus=pci.3,addr=0x2,iothread=iothread-virtioscsi1 -drive file=/dev/zvol/rpool/data/vm-103-disk-1,if=none,id=drive-scsi1,cache=none,aio=threads,discard=on,format=raw,detect-zeroes=unmap -device scsi-hd,bus=virtioscsi1.0,channel=0,scsi-id=0,lun=1,drive=drive-scsi1,id=scsi1,rotation_rate=1,bootindex=102 -device virtio-scsi-pci,id=virtioscsi2,bus=pci.3,addr=0x3,iothread=iothread-virtioscsi2 -drive file=/dev/zvol/rpool/data/vm-103-disk-2,if=none,id=drive-scsi2,cache=none,aio=threads,discard=on,format=raw,detect-zeroes=unmap -device scsi-hd,bus=virtioscsi2.0,channel=0,scsi-id=0,lun=2,drive=drive-scsi2,id=scsi2,rotation_rate=1,bootindex=103 -device virtio-scsi-pci,id=virtioscsi3,bus=pci.3,addr=0x4,iothread=iothread-virtioscsi3 -drive file=/dev/zvol/rpool/data/vm-103-disk-3,if=none,id=drive-scsi3,cache=none,aio=threads,discard=on,format=raw,detect-zeroes=unmap -device scsi-hd,bus=virtioscsi3.0,channel=0,scsi-id=0,lun=3,drive=drive-scsi3,id=scsi3,rotation_rate=1,bootindex=104 -device virtio-scsi-pci,id=virtioscsi4,bus=pci.3,addr=0x5,iothread=iothread-virtioscsi4 -drive file=/dev/zvol/rpool/data/vm-103-disk-4,if=none,id=drive-scsi4,cache=none,aio=threads,discard=on,format=raw,detect-zeroes=unmap -device scsi-hd,bus=virtioscsi4.0,channel=0,scsi-id=0,lun=4,drive=drive-scsi4,id=scsi4,rotation_rate=1,bootindex=105 -netdev type=tap,id=net0,ifname=tap103i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on,queues=8 -device virtio-net-pci,mac=00:50:56:91:72:cf,netdev=net0,bus=pci.0,addr=0x12,id=net0,vectors=18,mq=on,packed=on,rx_queue_size=1024,tx_queue_size=256 -machine type=q35+pve0
 
I checked the PIDs. It was completely mixed, some Linux VMs, some Windows VMs, larger VMs, small VMs.
It only takes one VM to bring it all down. That's part of the "fun" in troubleshooting this.
Have you had any improvement in performance?

You may need to consider a microcode update. Can you post the output of:
journalctl --no-hostname -o short-monotonic --boot -0 | sed -n '1,/PM: Preparing system for sleep/p' | grep 'microcode\|smp'
 
It only takes one VM to bring it all down. That's part of the "fun" in troubleshooting this.
Have you had any improvement in performance?

You may need to consider a microcode update. Can you post the output of:
journalctl --no-hostname -o short-monotonic --boot -0 | sed -n '1,/PM: Preparing system for sleep/p' | grep 'microcode\|smp'
No, no improvements regarding performance. With NVMe SSDs everything is fast already.
Currently, without ARC, it is of course slower than before I started my investigations.

Microcode updates are installed:
Code:
[    0.109704] kernel: smpboot: Allowing 32 CPUs, 0 hotplug CPUs
[    0.294373] kernel: Register File Data Sampling: Vulnerable: No microcode
[    0.325197] kernel: smpboot: CPU0: 13th Gen Intel(R) Core(TM) i9-13900 (family: 0x6, model: 0xb7, stepping: 0x1)
[    0.327878] kernel: smp: Bringing up secondary CPUs ...
[    0.327983] kernel: smpboot: x86: Booting SMP configuration:
[    0.352166] kernel: smp: Brought up 1 node, 32 CPUs
[    0.352166] kernel: smpboot: Max logical packages: 1
[    0.352166] kernel: smpboot: Total of 32 processors activated (127795.20 BogoMIPS)
[    0.999413] kernel: microcode: Current revision: 0x0000011f
 
So rev 0x11f is a little stale.
0x122 is in the 20240312 staging release at https://github.com/intel/Intel-Linux-Processor-Microcode-Data-Files/releases.
Just search for your F/M/S, i.e. search in the page for "06-b7-01" (no quotes), and you will find it.

So prep the Debian microcode apt source with:
echo deb http://ftp.au.debian.org/debian sid non-free-firmware > /etc/apt/sources.list.d/firmware.list

Then do an update to refresh sources, e.g. apt-get update && apt-get dist-upgrade and reboot (if you can).
Install the microcode: apt-get install intel-microcode
Reboot when you can...

Proxmox block the microcode module but it should work using this method instead (mine does).
You will get a warning in the Updates page for your node, but perhaps it will fix your issue.
If it does fix the problem, perhaps you can escalate it with Hetzner...

You may want to deal with MSRs too (if you haven't already), as follows:

Until next boot:
echo 1 > /sys/module/kvm/parameters/ignore_msrs
echo 0 > /sys/module/kvm/parameters/report_ignored_msrs

Or to persist over reboots:
echo "options kvm ignore_msrs=1 report_ignored_msrs=0" > /etc/modprobe.d/kvm.conf
 
Last edited:
  • Like
Reactions: ksb
My bad - missed a step:

apt-get install intel-microcode

And prior to this apt-get update is sufficient.

I will fix the above post.
 
  • Like
Reactions: ksb
Thanks a lot @benyamin .
I will wait for the outcome of "cache=none". If this will not help I will disable iommu (intel_iommu=off). After this I will check the microcode and MSR stuff.
Looks like a plan :) .

edit: Hetzner support replied. There is a BIOS update 2008 (currently 2007) for my motherboard.
 
Last edited:
  • Like
Reactions: Kingneutron
Status update:

#1: Disabled ARC using primarycache=none -> still crashes
#2: Set aio=threads on all VMs -> still crashes (2024-05-02, SSH possible)
#3: Set cache=none on all VMs -> still crashes (no kernel logs?, 2024-05-03, SSH not possible)
#4: Set intel_iommu=off -> still crashes (no kernel logs?, quite fast after only 1 hour, 2024-05-03, SSH not possible)

--> #5: Updated BIOS to 2008, 2024-05-03 <--
#6: Install intel-microcode from debian sid -> pending
#7: Deal with MSRs -> pending
 
Last edited:
Code:
May 03 09:34:39 srv02 kernel: VERIFY3(remove_reference(hdr, hdr) > 0) failed (0 > 0)
May 03 09:34:39 srv02 kernel: PANIC at arc.c:6610:arc_write_done()
May 03 09:34:39 srv02 kernel: Showing stack for process 802
May 03 09:34:39 srv02 kernel: CPU: 20 PID: 802 Comm: z_wr_int_0 Tainted: P           O       6.8.4-2-pve #1
May 03 09:34:39 srv02 kernel: Hardware name: ASUSTeK COMPUTER INC. System Product Name/W680/MB DC, BIOS 2008 03/20/2024
May 03 09:34:39 srv02 kernel: Call Trace:
May 03 09:34:39 srv02 kernel:  <TASK>
May 03 09:34:39 srv02 kernel:  dump_stack_lvl+0x48/0x70
May 03 09:34:39 srv02 kernel:  dump_stack+0x10/0x20
May 03 09:34:39 srv02 kernel:  spl_dumpstack+0x29/0x40 [spl]
May 03 09:34:39 srv02 kernel:  spl_panic+0xfc/0x120 [spl]
May 03 09:34:39 srv02 kernel:  arc_write_done+0x44f/0x550 [zfs]
May 03 09:34:39 srv02 kernel:  zio_done+0x289/0x10b0 [zfs]
May 03 09:34:39 srv02 kernel:  zio_execute+0x88/0x130 [zfs]
May 03 09:34:39 srv02 kernel:  taskq_thread+0x27f/0x490 [spl]
May 03 09:34:39 srv02 kernel:  ? __pfx_default_wake_function+0x10/0x10
May 03 09:34:39 srv02 kernel:  ? __pfx_zio_execute+0x10/0x10 [zfs]
May 03 09:34:39 srv02 kernel:  ? __pfx_taskq_thread+0x10/0x10 [spl]
May 03 09:34:39 srv02 kernel:  kthread+0xef/0x120
May 03 09:34:39 srv02 kernel:  ? __pfx_kthread+0x10/0x10
May 03 09:34:39 srv02 kernel:  ret_from_fork+0x44/0x70
May 03 09:34:39 srv02 kernel:  ? __pfx_kthread+0x10/0x10
May 03 09:34:39 srv02 kernel:  ret_from_fork_asm+0x1b/0x30
May 03 09:34:39 srv02 kernel:  </TASK>
May 03 09:35:01 srv02 CRON[66525]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
May 03 09:35:01 srv02 CRON[66526]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
May 03 09:35:01 srv02 CRON[66525]: pam_unix(cron:session): session closed for user root
May 03 09:35:08 srv02 kernel: BUG: kernel NULL pointer dereference, address: 0000000000000060
May 03 09:35:08 srv02 kernel: #PF: supervisor read access in kernel mode
May 03 09:35:08 srv02 kernel: #PF: error_code(0x0000) - not-present page
May 03 09:35:08 srv02 kernel: PGD 0 P4D 0
May 03 09:35:08 srv02 kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
May 03 09:35:08 srv02 kernel: CPU: 12 PID: 3344 Comm: zvol Tainted: P           O       6.8.4-2-pve #1
May 03 09:35:08 srv02 kernel: Hardware name: ASUSTeK COMPUTER INC. System Product Name/W680/MB DC, BIOS 2008 03/20/2024
May 03 09:35:08 srv02 kernel: RIP: 0010:arc_buf_access+0x15/0x1c0 [zfs]
May 03 09:35:08 srv02 kernel: Code: 00 00 00 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54 53 48 8b 1f <48> 81 7b 60 40 71 ca c0 0f 84 f5 00 00 00 48 8b 33 48 8b 53 08 48
May 03 09:35:08 srv02 kernel: RSP: 0018:ffffb4df936c38d8 EFLAGS: 00010282
May 03 09:35:08 srv02 kernel: RAX: ffff89dca1c921b0 RBX: 0000000000000000 RCX: 0000000000000000
May 03 09:35:08 srv02 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff89dc4d553180
May 03 09:35:08 srv02 kernel: RBP: ffffb4df936c3900 R08: 0000000000000000 R09: 0000000000000000
May 03 09:35:08 srv02 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
May 03 09:35:08 srv02 kernel: R13: ffff89dca1c921b0 R14: 0000000000000000 R15: ffffb4df936c3990
May 03 09:35:08 srv02 kernel: FS:  0000000000000000(0000) GS:ffff89eb7ee00000(0000) knlGS:0000000000000000
May 03 09:35:08 srv02 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 03 09:35:08 srv02 kernel: CR2: 0000000000000060 CR3: 000000018c796000 CR4: 0000000000f52ef0
May 03 09:35:08 srv02 kernel: PKRU: 55555554
May 03 09:35:08 srv02 kernel: Call Trace:
May 03 09:35:08 srv02 kernel:  <TASK>
May 03 09:35:08 srv02 kernel:  ? show_regs+0x6d/0x80
May 03 09:35:08 srv02 kernel:  ? __die+0x24/0x80
May 03 09:35:08 srv02 kernel:  ? page_fault_oops+0x176/0x500
May 03 09:35:08 srv02 kernel:  ? arc_read+0xd28/0x17c0 [zfs]
May 03 09:35:08 srv02 kernel:  ? do_user_addr_fault+0x2f9/0x6b0
May 03 09:35:08 srv02 kernel:  ? exc_page_fault+0x83/0x1b0
May 03 09:35:08 srv02 kernel:  ? asm_exc_page_fault+0x27/0x30
May 03 09:35:08 srv02 kernel:  ? arc_buf_access+0x15/0x1c0 [zfs]
May 03 09:35:08 srv02 kernel:  dbuf_hold_impl+0x9a/0x730 [zfs]
May 03 09:35:08 srv02 kernel:  dbuf_hold+0x33/0x70 [zfs]
May 03 09:35:08 srv02 kernel:  dmu_buf_hold_array_by_dnode+0x155/0x6a0 [zfs]
May 03 09:35:08 srv02 kernel:  dmu_read_impl+0x12c/0x1e0 [zfs]
May 03 09:35:08 srv02 kernel:  ? spl_kvmalloc+0x84/0xc0 [spl]
May 03 09:35:08 srv02 kernel:  dmu_read_by_dnode+0xe/0x20 [zfs]
May 03 09:35:08 srv02 kernel:  zvol_get_data+0xac/0x1a0 [zfs]
May 03 09:35:08 srv02 kernel:  zil_lwb_write_issue+0xcb4/0xd90 [zfs]
May 03 09:35:08 srv02 kernel:  zil_commit_impl+0x21d/0x1260 [zfs]
May 03 09:35:08 srv02 kernel:  zil_commit+0x3d/0x80 [zfs]
May 03 09:35:08 srv02 kernel:  zvol_write+0x3b1/0x670 [zfs]
May 03 09:35:08 srv02 kernel:  zvol_write_task+0x12/0x30 [zfs]
May 03 09:35:08 srv02 kernel:  taskq_thread+0x27f/0x490 [spl]
May 03 09:35:08 srv02 kernel:  ? __pfx_default_wake_function+0x10/0x10
May 03 09:35:08 srv02 kernel:  ? __pfx_zvol_write_task+0x10/0x10 [zfs]
May 03 09:35:08 srv02 kernel:  ? __pfx_taskq_thread+0x10/0x10 [spl]
May 03 09:35:08 srv02 kernel:  kthread+0xef/0x120
May 03 09:35:08 srv02 kernel:  ? __pfx_kthread+0x10/0x10
May 03 09:35:08 srv02 kernel:  ret_from_fork+0x44/0x70
May 03 09:35:08 srv02 kernel:  ? __pfx_kthread+0x10/0x10
May 03 09:35:08 srv02 kernel:  ret_from_fork_asm+0x1b/0x30
May 03 09:35:08 srv02 kernel:  </TASK>
May 03 09:35:08 srv02 kernel: Modules linked in: tcp_diag inet_diag cmac nls_utf8 cifs cifs_arc4 nls_ucs2_utils rdma_cm iw_cm ib_cm ib_core cifs_md4 netfs ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter nf_tables bonding tls softdog su>
May 03 09:35:08 srv02 kernel:  btrfs blake2b_generic xor raid6_pq libcrc32c hid_generic usbkbd usbmouse usbhid hid mfd_aaeon xhci_pci asus_wmi nvme ledtrig_audio xhci_pci_renesas sparse_keymap platform_profile crc32_pclmul ahci igc nvme_core xhci_hcd intel_lpss_pci i2c_i801 spi_intel_>
May 03 09:35:08 srv02 kernel: CR2: 0000000000000060
May 03 09:35:08 srv02 kernel: ---[ end trace 0000000000000000 ]---
May 03 09:35:08 srv02 kernel: RIP: 0010:arc_buf_access+0x15/0x1c0 [zfs]
May 03 09:35:08 srv02 kernel: Code: 00 00 00 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54 53 48 8b 1f <48> 81 7b 60 40 71 ca c0 0f 84 f5 00 00 00 48 8b 33 48 8b 53 08 48
May 03 09:35:08 srv02 kernel: RSP: 0018:ffffb4df936c38d8 EFLAGS: 00010282
May 03 09:35:08 srv02 kernel: RAX: ffff89dca1c921b0 RBX: 0000000000000000 RCX: 0000000000000000
May 03 09:35:08 srv02 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff89dc4d553180
May 03 09:35:08 srv02 kernel: RBP: ffffb4df936c3900 R08: 0000000000000000 R09: 0000000000000000
May 03 09:35:08 srv02 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
May 03 09:35:08 srv02 kernel: R13: ffff89dca1c921b0 R14: 0000000000000000 R15: ffffb4df936c3990
May 03 09:35:08 srv02 kernel: FS:  0000000000000000(0000) GS:ffff89eb7ee00000(0000) knlGS:0000000000000000
May 03 09:35:08 srv02 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 03 09:35:08 srv02 kernel: CR2: 0000000000000060 CR3: 000000018c796000 CR4: 0000000000f52ef0
May 03 09:35:08 srv02 kernel: PKRU: 55555554
May 03 09:35:08 srv02 kernel: note: zvol[3344] exited with irqs disabled

The first VM was not reachable. I quickly connected via SSH to the host and killed the VM.
Afterwards the other VMs started one after one to hang/crash, too.
So the crashes happen now more often than before :( .
 
Last edited:
Status update:

#1: Disabled ARC using primarycache=none -> still crashes
#2: Set aio=threads on all VMs -> still crashes (2024-05-02, SSH possible)
#3: Set cache=none on all VMs -> still crashes (no kernel logs?, 2024-05-03, SSH not possible)
#4: Set intel_iommu=off -> still crashes (no kernel logs?, quite fast after only 1 hour, 2024-05-03, SSH not possible)
#5: Updated BIOS to 2008, 2024-05-03 (still crashes)

--> #6: Install intel-microcode from debian sid (2024-05-03) <--- (Current revision: 0x00000122 <- Updated early from: 0x0000011f)
#7: Deal with MSRs -> pending
#8: Disable KSM -> pending
 
Last edited:
aio=threads, cache=none, intel_iommu=off on host, newest intel_microcode from debian sid

Code:
#### 105 ####
strace -c -p $(cat /var/run/qemu-server/105.pid)
strace: Process 3872 attached
^Cstrace: Process 3872 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00    0.019974       19974         1           ppoll
  0.00    0.000000           0         1           read
------ ----------- ----------- --------- --------- ----------------
100.00    0.019974        9987         2           total
####

#### 108 #### (nothing comes back after 10sec?)
strace -c -p $(cat /var/run/qemu-server/108.pid)
strace: Process 3731 attached
^Cstrace: Process 3731 detached

#### 109 #####
strace: Process 3961 attached
^Cstrace: Process 3961 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.99    0.066967        3348        20           ppoll
  0.00    0.000003           0        18           ioctl
  0.00    0.000001           0         6           read
  0.00    0.000001           0         6           futex
------ ----------- ----------- --------- --------- ----------------
100.00    0.066972        1339        50           total
####

Another crash
 
Last edited:
Code:
May 03 14:49:23 srv02 kernel: could not locate request for tag 0xfff
May 03 14:49:23 srv02 kernel: nvme nvme1: invalid id 65535 completed on queue 16
May 03 14:52:47 srv02 kernel: INFO: task txg_sync:756 blocked for more than 122 seconds.
May 03 14:52:47 srv02 kernel:       Tainted: P           O       6.8.4-2-pve #1
May 03 14:52:47 srv02 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 03 14:52:47 srv02 kernel: task:txg_sync        state:D stack:0     pid:756   tgid:756   ppid:2      flags:0x00004000
May 03 14:52:47 srv02 kernel: Call Trace:
May 03 14:52:47 srv02 kernel:  <TASK>
May 03 14:52:47 srv02 kernel:  __schedule+0x401/0x15e0
May 03 14:52:47 srv02 kernel:  schedule+0x33/0x110
May 03 14:52:47 srv02 kernel:  schedule_timeout+0x95/0x170
May 03 14:52:47 srv02 kernel:  ? __pfx_process_timeout+0x10/0x10
May 03 14:52:47 srv02 kernel:  io_schedule_timeout+0x51/0x80
May 03 14:52:47 srv02 kernel:  __cv_timedwait_common+0x140/0x180 [spl]
May 03 14:52:47 srv02 kernel:  ? __pfx_autoremove_wake_function+0x10/0x10
May 03 14:52:47 srv02 kernel:  __cv_timedwait_io+0x19/0x30 [spl]
May 03 14:52:47 srv02 kernel:  zio_wait+0x13a/0x2c0 [zfs]
May 03 14:52:47 srv02 kernel:  dsl_pool_sync+0xce/0x4e0 [zfs]
May 03 14:52:47 srv02 kernel:  spa_sync+0x578/0x1030 [zfs]
May 03 14:52:47 srv02 kernel:  ? spa_txg_history_init_io+0x120/0x130 [zfs]
May 03 14:52:47 srv02 kernel:  txg_sync_thread+0x1fd/0x390 [zfs]
May 03 14:52:47 srv02 kernel:  ? __pfx_txg_sync_thread+0x10/0x10 [zfs]
May 03 14:52:47 srv02 kernel:  ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
May 03 14:52:47 srv02 kernel:  thread_generic_wrapper+0x5c/0x70 [spl]
May 03 14:52:47 srv02 kernel:  kthread+0xef/0x120
May 03 14:52:47 srv02 kernel:  ? __pfx_kthread+0x10/0x10
May 03 14:52:47 srv02 kernel:  ret_from_fork+0x44/0x70
May 03 14:52:47 srv02 kernel:  ? __pfx_kthread+0x10/0x10
May 03 14:52:47 srv02 kernel:  ret_from_fork_asm+0x1b/0x30
May 03 14:52:47 srv02 kernel:  </TASK>
May 03 14:54:50 srv02 kernel: INFO: task txg_sync:756 blocked for more than 245 seconds.
May 03 14:54:50 srv02 kernel:       Tainted: P           O       6.8.4-2-pve #1
May 03 14:54:50 srv02 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 03 14:54:50 srv02 kernel: task:txg_sync        state:D stack:0     pid:756   tgid:756   ppid:2      flags:0x00004000
May 03 14:54:50 srv02 kernel: Call Trace:
May 03 14:54:50 srv02 kernel:  <TASK>
May 03 14:54:50 srv02 kernel:  __schedule+0x401/0x15e0
May 03 14:54:50 srv02 kernel:  schedule+0x33/0x110
May 03 14:54:50 srv02 kernel:  schedule_timeout+0x95/0x170
May 03 14:54:50 srv02 kernel:  ? __pfx_process_timeout+0x10/0x10
May 03 14:54:50 srv02 kernel:  io_schedule_timeout+0x51/0x80
May 03 14:54:50 srv02 kernel:  __cv_timedwait_common+0x140/0x180 [spl]
May 03 14:54:50 srv02 kernel:  ? __pfx_autoremove_wake_function+0x10/0x10
May 03 14:54:50 srv02 kernel:  __cv_timedwait_io+0x19/0x30 [spl]
May 03 14:54:50 srv02 kernel:  zio_wait+0x13a/0x2c0 [zfs]
May 03 14:54:50 srv02 kernel:  dsl_pool_sync+0xce/0x4e0 [zfs]
May 03 14:54:50 srv02 kernel:  spa_sync+0x578/0x1030 [zfs]
May 03 14:54:50 srv02 kernel:  ? spa_txg_history_init_io+0x120/0x130 [zfs]
May 03 14:54:50 srv02 kernel:  txg_sync_thread+0x1fd/0x390 [zfs]
May 03 14:54:50 srv02 kernel:  ? __pfx_txg_sync_thread+0x10/0x10 [zfs]
May 03 14:54:50 srv02 kernel:  ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
May 03 14:54:50 srv02 kernel:  thread_generic_wrapper+0x5c/0x70 [spl]
May 03 14:54:50 srv02 kernel:  kthread+0xef/0x120
May 03 14:54:50 srv02 kernel:  ? __pfx_kthread+0x10/0x10
May 03 14:54:50 srv02 kernel:  ret_from_fork+0x44/0x70
May 03 14:54:50 srv02 kernel:  ? __pfx_kthread+0x10/0x10
May 03 14:54:50 srv02 kernel:  ret_from_fork_asm+0x1b/0x30
May 03 14:54:50 srv02 kernel:  </TASK>
May 03 14:54:54 srv02 zed[97816]: eid=7 class=deadman pool='rpool' vdev=nvme-eui.36344730571023490025385300000001-part3 size=102400 offset=1588892266496 priority=3 err=0 flags=0x40080480 delay=17368350ms
May 03 14:54:54 srv02 zed[97812]: eid=6 class=deadman pool='rpool' vdev=nvme-eui.36344730571023490025385300000001-part3 size=4096 offset=1588892311552 priority=3 err=0 flags=0x380080 bookmark=3331:1:0:199839
May 03 14:54:54 srv02 zed[97817]: eid=8 class=deadman pool='rpool' vdev=nvme-eui.36344730571023490025385300000001-part3 size=4096 offset=1588892364800 priority=3 err=0 flags=0x380080 bookmark=3331:1:0:199844
May 03 14:54:54 srv02 zed[97820]: eid=9 class=deadman pool='rpool' vdev=nvme-eui.36344730571023490025385300000001-part3 size=102400 offset=1588892266496 priority=3 err=0 flags=0x40080480 delay=17368350ms
May 03 14:54:54 srv02 zed[97824]: eid=10 class=deadman pool='rpool' vdev=nvme-eui.36344730571023490025385300000001-part3 size=4096 offset=1588892315648 priority=3 err=0 flags=0x380080 bookmark=3331:1:0:199840
May 03 14:54:54 srv02 zed[97826]: eid=11 class=deadman pool='rpool' vdev=nvme-eui.36344730571023490025385300000001-part3 size=102400 offset=1588892266496 priority=3 err=0 flags=0x40080480 delay=17368350ms
May 03 14:54:54 srv02 zed[97828]: eid=12 class=deadman pool='rpool' vdev=nvme-eui.36344730571023490025385300000001-part3 size=45056 offset=1588892266496 priority=3 err=0 flags=0x380080 bookmark=3477:1:1:2634
[...]
 
Status update:

#1: Disabled ARC using primarycache=none -> still crashes
#2: Set aio=threads on all VMs -> still crashes (2024-05-02, SSH possible)
#3: Set cache=none on all VMs -> still crashes (no kernel logs?, 2024-05-03, SSH not possible)
#4: Set intel_iommu=off -> still crashes (no kernel logs?, quite fast after only 1 hour, 2024-05-03, SSH not possible)
#5: Updated BIOS to 2008, 2024-05-03 (still crashes)
#6: Install intel-microcode from debian sid --> 2024-05-03 still crashes (Current revision: 0x00000122 <- Updated early from: 0x0000011f)

--> #7: Disable KSM -> 2024-05-03 disabled
#8: Deal with MSRs -> pending
 
@fiona , you did some investigations in the huge "100% cpu freeze thread" which seems to have some overlapping things compared to my issues (see strace output two posts above).
Is that the case? Or are my problems completely different?
 
This reminds me of an old incident that a 6700K, still within warranty period back then, was half-broken and caused random BSOD on Windows/kernel panic on Linux while under virtualization load.
If we don't use virtualization and stress the CPU with other workload, say Prime 95 or ADIA64 system stress testing, everything works fine for days.

Our guess was someone managed to damage it with ungraceful opening of microwave ovens. Intel even tried to decline our warranty claim at the beginning lol.
 
Last edited:
It is now crashing again and again... before I started my troubleshooting, it was crashing 1-2x per day :(

Code:
May 03 16:05:15 srv02 kernel: INFO: task txg_sync:772 blocked for more than 122 seconds.
May 03 16:05:15 srv02 kernel:       Tainted: P           O       6.8.4-2-pve #1
May 03 16:05:15 srv02 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 03 16:05:15 srv02 kernel: task:txg_sync        state:D stack:0     pid:772   tgid:772   ppid:2      flags:0x00004000
May 03 16:05:15 srv02 kernel: Call Trace:
May 03 16:05:15 srv02 kernel:  <TASK>
May 03 16:05:15 srv02 kernel:  __schedule+0x401/0x15e0
May 03 16:05:15 srv02 kernel:  schedule+0x33/0x110
May 03 16:05:15 srv02 kernel:  schedule_timeout+0x95/0x170
May 03 16:05:15 srv02 kernel:  ? __pfx_process_timeout+0x10/0x10
May 03 16:05:15 srv02 kernel:  io_schedule_timeout+0x51/0x80
May 03 16:05:15 srv02 kernel:  __cv_timedwait_common+0x140/0x180 [spl]
May 03 16:05:15 srv02 kernel:  ? __pfx_autoremove_wake_function+0x10/0x10
May 03 16:05:15 srv02 kernel:  __cv_timedwait_io+0x19/0x30 [spl]
May 03 16:05:15 srv02 kernel:  zio_wait+0x13a/0x2c0 [zfs]
May 03 16:05:15 srv02 kernel:  dsl_pool_sync+0xce/0x4e0 [zfs]
May 03 16:05:15 srv02 kernel:  spa_sync+0x578/0x1030 [zfs]
May 03 16:05:15 srv02 kernel:  ? spa_txg_history_init_io+0x120/0x130 [zfs]
May 03 16:05:15 srv02 kernel:  txg_sync_thread+0x1fd/0x390 [zfs]
May 03 16:05:15 srv02 kernel:  ? __pfx_txg_sync_thread+0x10/0x10 [zfs]
May 03 16:05:15 srv02 kernel:  ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
May 03 16:05:15 srv02 kernel:  thread_generic_wrapper+0x5c/0x70 [spl]
May 03 16:05:15 srv02 kernel:  kthread+0xef/0x120
May 03 16:05:15 srv02 kernel:  ? __pfx_kthread+0x10/0x10
May 03 16:05:15 srv02 kernel:  ret_from_fork+0x44/0x70
May 03 16:05:15 srv02 kernel:  ? __pfx_kthread+0x10/0x10
May 03 16:05:15 srv02 kernel:  ret_from_fork_asm+0x1b/0x30
May 03 16:05:15 srv02 kernel:  </TASK>
May 03 16:07:17 srv02 kernel: INFO: task txg_sync:772 blocked for more than 245 seconds.
May 03 16:07:17 srv02 kernel:       Tainted: P           O       6.8.4-2-pve #1
May 03 16:07:17 srv02 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 03 16:07:17 srv02 kernel: task:txg_sync        state:D stack:0     pid:772   tgid:772   ppid:2      flags:0x00004000
May 03 16:07:17 srv02 kernel: Call Trace:
May 03 16:07:17 srv02 kernel:  <TASK>
May 03 16:07:17 srv02 kernel:  __schedule+0x401/0x15e0
May 03 16:07:17 srv02 kernel:  schedule+0x33/0x110
May 03 16:07:17 srv02 kernel:  schedule_timeout+0x95/0x170
May 03 16:07:17 srv02 kernel:  ? __pfx_process_timeout+0x10/0x10
May 03 16:07:17 srv02 kernel:  io_schedule_timeout+0x51/0x80
May 03 16:07:17 srv02 kernel:  __cv_timedwait_common+0x140/0x180 [spl]
May 03 16:07:17 srv02 kernel:  ? __pfx_autoremove_wake_function+0x10/0x10
May 03 16:07:17 srv02 kernel:  __cv_timedwait_io+0x19/0x30 [spl]
May 03 16:07:17 srv02 kernel:  zio_wait+0x13a/0x2c0 [zfs]
May 03 16:07:17 srv02 kernel:  dsl_pool_sync+0xce/0x4e0 [zfs]
May 03 16:07:17 srv02 kernel:  spa_sync+0x578/0x1030 [zfs]
May 03 16:07:17 srv02 kernel:  ? spa_txg_history_init_io+0x120/0x130 [zfs]
May 03 16:07:17 srv02 kernel:  txg_sync_thread+0x1fd/0x390 [zfs]
May 03 16:07:17 srv02 kernel:  ? __pfx_txg_sync_thread+0x10/0x10 [zfs]
May 03 16:07:17 srv02 kernel:  ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
May 03 16:07:17 srv02 kernel:  thread_generic_wrapper+0x5c/0x70 [spl]
May 03 16:07:17 srv02 kernel:  kthread+0xef/0x120
May 03 16:07:17 srv02 kernel:  ? __pfx_kthread+0x10/0x10
May 03 16:07:17 srv02 kernel:  ret_from_fork+0x44/0x70
May 03 16:07:17 srv02 kernel:  ? __pfx_kthread+0x10/0x10
May 03 16:07:17 srv02 kernel:  ret_from_fork_asm+0x1b/0x30
May 03 16:07:17 srv02 kernel:  </TASK>
May 03 16:08:27 srv02 zed[1921]: Missed 4 events
May 03 16:08:27 srv02 zed[1921]: Bumping queue length to 1024
May 03 16:08:27 srv02 zed[14807]: eid=6 class=deadman pool='rpool' vdev=nvme-eui.36344730571023550025385300000001-part3 size=4096 offset=139009306624 priority=3 err=0 flags=0x380080 bookmark=2192:1:0:1310682
May 03 16:08:27 srv02 zed[14815]: eid=8 class=deadman pool='rpool' vdev=nvme-eui.36344730571023550025385300000001-part3 size=4096 offset=139009302528 priority=3 err=0 flags=0x380080 bookmark=2192:1:0:1310698
May 03 16:08:27 srv02 zed[14811]: eid=7 class=deadman pool='rpool' vdev=nvme-eui.36344730571023550025385300000001-part3 size=73728 offset=139009245184 priority=3 err=0 flags=0x40080480 delay=1705706ms
[...]
 
Status update:

#1: Disabled ARC using primarycache=none -> still crashes
#2: Set aio=threads on all VMs -> still crashes (2024-05-02, SSH possible)
#3: Set cache=none on all VMs -> still crashes (no kernel logs?, 2024-05-03, SSH not possible)
#4: Set intel_iommu=off -> still crashes (no kernel logs?, quite fast after only 1 hour, 2024-05-03, SSH not possible)
#5: Updated BIOS to 2008, 2024-05-03 (still crashes)
#6: Install intel-microcode from debian sid --> 2024-05-03 still crashes (Current revision: 0x00000122 <- Updated early from: 0x0000011f)

#7: Disable KSM -> 2024-05-03 disabled
--> #8: Deal with MSRs <-- options kvm ignore_msrs=1 report_ignored_msrs=0 set in kvm.conf
#9: go back to kernel 6.5 but leave all the modifications in place -> pending
#10: maybe /sys/block/nvmeXn1/queue/scheduler from none to mq-deadline -> pending
#11: maybe try mitigations=off -> pending
 
Last edited:
Another crash, but now I was fast enough to also check the web UI. Logs are attached.
I will lower the RAM of some VMs a little bit to give the host more room.

1714796069852.png
 

Attachments

After 2 hours it crashed again, but without any logs.
Now back to 6.5.13-5

Status update:

#1: Disabled ARC using primarycache=none -> still crashes
#2: Set aio=threads on all VMs -> still crashes (2024-05-02, SSH possible)
#3: Set cache=none on all VMs -> still crashes (no kernel logs?, 2024-05-03, SSH not possible)
#4: Set intel_iommu=off -> still crashes (no kernel logs?, quite fast after only 1 hour, 2024-05-03, SSH not possible)
#5: Updated BIOS to 2008, 2024-05-03 (still crashes)
#6: Install intel-microcode from debian sid --> 2024-05-03 still crashes (Current revision: 0x00000122 <- Updated early from: 0x0000011f)

#7: Disable KSM -> 2024-05-03 disabled
#8: Deal with MSRs <-- options kvm ignore_msrs=1 report_ignored_msrs=0 set in kvm.conf

--> #9: go back to kernel 6.5 but leave all the modifications in place
#10: maybe /sys/block/nvmeXn1/queue/scheduler from none to mq-deadline -> pending
#11: maybe try mitigations=off -> pending
#12: maybe lower the RAM from DDR5-4400 to DDR5-4200
 
Last edited:
Please check with
pcie_aspm=off intel_idle.max_cstate=0 intel_pstate=disable processor.max_cstate=1

Remove pcie_aspm.policy=performance split_lock_detect=off
C-States are disabled in the BIOS already, but I will check the rest.
Without pcie_aspm.policy=performance the network card was not stable. Completely disable it is also a good call.

Thank you.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!