Unable to create, snapshot, or backup VMs or CTs - Kernel RIP on ZFS lz4 compress

Jul 11, 2020
5
0
6
45
I have been experiencing issues with one of the nodes in my cluster recently where I am unable to create, snapshot, backup, or restore VMs or CTs. The only "fix" is to hard reboot the node. All the VMs and CTs come back online and operations work fine for a while but eventually everything hangs again. The running VMs and CTs continue to work but PVE related operations seem to just hang.

I see some zfs processes that appear to have been around for days. Manually running zfs create commands also seem to hang. The processes from May 18 were probably from me trying to execute tasks from the WebUI and when nothing happened I stopped the task. The rpoo/data/test command was run on the command line this morning (May 20) and it also hung.

Code:
root       97134   97130  0 May18 ?        00:00:00 zfs snapshot rpool/data/subvol-104-disk-0@__replicate_104-0_1684424700__
root       97191   97184  0 May18 ?        00:00:02 /usr/bin/perl /usr/sbin/pvesm import local-zfs:subvol-101-disk-0 zfs - -with-snapshots 1 -snapshot __replicate_101-0_1684424703__ -allow-rename 0 -base __replicate_101-0_1684423803__
root       97199   97191  0 May18 ?        00:00:00 zfs recv -F -- rpool/data/subvol-101-disk-0
root      511827       1  0 06:17 ?        00:00:00 zfs create -s -V 33554432k rpool/data/vm-108-disk-0
root      518060       1  0 06:53 ?        00:00:00 zfs create -s -V 33554432k rpool/data/vm-108-disk-0
root      518421       1  0 06:55 ?        00:00:00 zfs create -s -V 33554432k rpool/data/test

pvemsm status shows the following:

Code:
Name             Type     Status           Total            Used       Available        %
local             dir     active      3862533504         4139904      3858393600    0.11%
local-zfs     zfspool     active      3864538180         6144536      3858393644    0.16%
nas-iso          cifs     active      7233540916         2072208      7231468708    0.03%

and zpool status shows no errors:

Code:
  pool: rpool
 state: ONLINE
config:

    NAME                                               STATE     READ WRITE CKSUM
    rpool                                              ONLINE       0     0     0
     nvme-eui.00000000000000000026b768606f4165-part3  ONLINE       0     0     0

errors: No known data errors

I found the following page fault in the logs that seems to be related to ZFS LZ4 compression. Anybody else seen this? I'm on ` 5.15.107-2`

Code:
May 18 08:35:18 pve-01-1244 kernel: BUG: unable to handle page fault for address: 0000000037333a37
May 18 08:35:18 pve-01-1244 kernel: #PF: supervisor read access in kernel mode
May 18 08:35:18 pve-01-1244 kernel: #PF: error_code(0x0000) - not-present page
May 18 08:35:18 pve-01-1244 kernel: PGD 0 P4D 0
May 18 08:35:18 pve-01-1244 kernel: Oops: 0000 [#1] SMP NOPTI
May 18 08:35:18 pve-01-1244 kernel: CPU: 9 PID: 95695 Comm: z_wr_iss Tainted: P           O      5.15.107-2-pve #1
May 18 08:35:18 pve-01-1244 kernel: Hardware name: Protectli VP4670/VP4670, BIOS 5.17 11/01/2022
May 18 08:35:18 pve-01-1244 kernel: RIP: 0010:lz4_compress_zfs+0x5d5/0x7b0 [zfs]
May 18 08:35:18 pve-01-1244 kernel: Code: 69 42 fe b1 79 37 9e 48 8d 4a fe 49 89 d0 48 29 d9 49 29 d8 c1 e8 14 41 89 4c 85 00 69 02 b1 79 37 9e c1 e8 14 49 8d 4>
May 18 08:35:18 pve-01-1244 kernel: RSP: 0018:ffffac8d28247ca0 EFLAGS: 00010a02
May 18 08:35:18 pve-01-1244 kernel: RAX: 0000000037333a37 RBX: ffffac8d083cf000 RCX: 00000000000134be
May 18 08:35:18 pve-01-1244 kernel: RDX: ffffac8d083e24c0 RSI: ffffac8d0d001000 RDI: ffffac8d083ef000
May 18 08:35:18 pve-01-1244 kernel: RBP: ffffac8d28247cf8 R08: 00000000000134c0 R09: ffffac8d083eeff4
May 18 08:35:18 pve-01-1244 kernel: R10: 37302d2037333a37 R11: ffffac8d083e1880 R12: 0000000000020000
May 18 08:35:18 pve-01-1244 kernel: R13: ffff9f1b59840000 R14: ffffac8d083e2456 R15: ffffac8d0cfe71f4
May 18 08:35:18 pve-01-1244 kernel: FS:  0000000000000000(0000) GS:ffff9f2a5e440000(0000) knlGS:0000000000000000
May 18 08:35:18 pve-01-1244 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 18 08:35:18 pve-01-1244 kernel: CR2: 0000000037333a37 CR3: 000000012a00c002 CR4: 00000000003726e0
May 18 08:35:18 pve-01-1244 kernel: Call Trace:
May 18 08:35:18 pve-01-1244 kernel:  <TASK>
May 18 08:35:18 pve-01-1244 kernel:  zio_compress_data+0xd2/0x120 [zfs]
May 18 08:35:18 pve-01-1244 kernel:  zio_write_compress+0x552/0xa10 [zfs]
May 18 08:35:18 pve-01-1244 kernel:  zio_execute+0x92/0x160 [zfs]
May 18 08:35:18 pve-01-1244 kernel:  taskq_thread+0x29c/0x4d0 [spl]
May 18 08:35:18 pve-01-1244 kernel:  ? wake_up_q+0x90/0x90
May 18 08:35:18 pve-01-1244 kernel:  ? zio_gang_tree_free+0x70/0x70 [zfs]
May 18 08:35:18 pve-01-1244 kernel:  ? taskq_thread_spawn+0x60/0x60 [spl]
May 18 08:35:18 pve-01-1244 kernel:  kthread+0x127/0x150
May 18 08:35:18 pve-01-1244 kernel:  ? set_kthread_struct+0x50/0x50
May 18 08:35:18 pve-01-1244 kernel:  ret_from_fork+0x1f/0x30
May 18 08:35:18 pve-01-1244 kernel:  </TASK>
May 18 08:35:18 pve-01-1244 kernel: Modules linked in: cmac nls_utf8 cifs cifs_arc4 cifs_md4 fscache netfs veth ebtable_filter ebtables ip_set ip6table_raw ipta>
May 18 08:35:18 pve-01-1244 kernel:  intel_cstate pcspkr efi_pstore i2c_algo_bit ee1004 fb_sys_fops snd syscopyarea mei_me sysfillrect soundcore sysimgblt mei i>
May 18 08:35:18 pve-01-1244 kernel: CR2: 0000000037333a37
May 18 08:35:18 pve-01-1244 kernel: ---[ end trace d594fd094fb45adc ]---
May 18 08:35:18 pve-01-1244 kernel: RIP: 0010:lz4_compress_zfs+0x5d5/0x7b0 [zfs]
May 18 08:35:18 pve-01-1244 kernel: Code: 69 42 fe b1 79 37 9e 48 8d 4a fe 49 89 d0 48 29 d9 49 29 d8 c1 e8 14 41 89 4c 85 00 69 02 b1 79 37 9e c1 e8 14 49 8d 4>
May 18 08:35:18 pve-01-1244 kernel: RSP: 0018:ffffac8d28247ca0 EFLAGS: 00010a02
May 18 08:35:18 pve-01-1244 kernel: RAX: 0000000037333a37 RBX: ffffac8d083cf000 RCX: 00000000000134be
May 18 08:35:18 pve-01-1244 kernel: RDX: ffffac8d083e24c0 RSI: ffffac8d0d001000 RDI: ffffac8d083ef000
May 18 08:35:18 pve-01-1244 kernel: RBP: ffffac8d28247cf8 R08: 00000000000134c0 R09: ffffac8d083eeff4
May 18 08:35:18 pve-01-1244 kernel: R10: 37302d2037333a37 R11: ffffac8d083e1880 R12: 0000000000020000
May 18 08:35:18 pve-01-1244 kernel: R13: ffff9f1b59840000 R14: ffffac8d083e2456 R15: ffffac8d0cfe71f4
May 18 08:35:18 pve-01-1244 kernel: FS:  0000000000000000(0000) GS:ffff9f2a5e440000(0000) knlGS:0000000000000000
May 18 08:35:18 pve-01-1244 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 18 08:35:18 pve-01-1244 kernel: CR2: 0000000037333a37 CR3: 000000012a00c002 CR4: 00000000003726e0
 
microcode didn't help - got the page fault a few hours later just as before.

I should mention that this is a node in a cluster of 3 in case that is relevant.
 
Have you tried asking upstream in the ZFS-on-Linux Git repository where the code is developed? Maybe they have some insight or know a fix?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!