I have been experiencing issues with one of the nodes in my cluster recently where I am unable to create, snapshot, backup, or restore VMs or CTs. The only "fix" is to hard reboot the node. All the VMs and CTs come back online and operations work fine for a while but eventually everything hangs again. The running VMs and CTs continue to work but PVE related operations seem to just hang.
I see some zfs processes that appear to have been around for days. Manually running zfs create commands also seem to hang. The processes from May 18 were probably from me trying to execute tasks from the WebUI and when nothing happened I stopped the task. The rpoo/data/test command was run on the command line this morning (May 20) and it also hung.
pvemsm status shows the following:
and zpool status shows no errors:
I found the following page fault in the logs that seems to be related to ZFS LZ4 compression. Anybody else seen this? I'm on ` 5.15.107-2`
I see some zfs processes that appear to have been around for days. Manually running zfs create commands also seem to hang. The processes from May 18 were probably from me trying to execute tasks from the WebUI and when nothing happened I stopped the task. The rpoo/data/test command was run on the command line this morning (May 20) and it also hung.
Code:
root 97134 97130 0 May18 ? 00:00:00 zfs snapshot rpool/data/subvol-104-disk-0@__replicate_104-0_1684424700__
root 97191 97184 0 May18 ? 00:00:02 /usr/bin/perl /usr/sbin/pvesm import local-zfs:subvol-101-disk-0 zfs - -with-snapshots 1 -snapshot __replicate_101-0_1684424703__ -allow-rename 0 -base __replicate_101-0_1684423803__
root 97199 97191 0 May18 ? 00:00:00 zfs recv -F -- rpool/data/subvol-101-disk-0
root 511827 1 0 06:17 ? 00:00:00 zfs create -s -V 33554432k rpool/data/vm-108-disk-0
root 518060 1 0 06:53 ? 00:00:00 zfs create -s -V 33554432k rpool/data/vm-108-disk-0
root 518421 1 0 06:55 ? 00:00:00 zfs create -s -V 33554432k rpool/data/test
pvemsm status shows the following:
Code:
Name Type Status Total Used Available %
local dir active 3862533504 4139904 3858393600 0.11%
local-zfs zfspool active 3864538180 6144536 3858393644 0.16%
nas-iso cifs active 7233540916 2072208 7231468708 0.03%
and zpool status shows no errors:
Code:
pool: rpool
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
nvme-eui.00000000000000000026b768606f4165-part3 ONLINE 0 0 0
errors: No known data errors
I found the following page fault in the logs that seems to be related to ZFS LZ4 compression. Anybody else seen this? I'm on ` 5.15.107-2`
Code:
May 18 08:35:18 pve-01-1244 kernel: BUG: unable to handle page fault for address: 0000000037333a37
May 18 08:35:18 pve-01-1244 kernel: #PF: supervisor read access in kernel mode
May 18 08:35:18 pve-01-1244 kernel: #PF: error_code(0x0000) - not-present page
May 18 08:35:18 pve-01-1244 kernel: PGD 0 P4D 0
May 18 08:35:18 pve-01-1244 kernel: Oops: 0000 [#1] SMP NOPTI
May 18 08:35:18 pve-01-1244 kernel: CPU: 9 PID: 95695 Comm: z_wr_iss Tainted: P O 5.15.107-2-pve #1
May 18 08:35:18 pve-01-1244 kernel: Hardware name: Protectli VP4670/VP4670, BIOS 5.17 11/01/2022
May 18 08:35:18 pve-01-1244 kernel: RIP: 0010:lz4_compress_zfs+0x5d5/0x7b0 [zfs]
May 18 08:35:18 pve-01-1244 kernel: Code: 69 42 fe b1 79 37 9e 48 8d 4a fe 49 89 d0 48 29 d9 49 29 d8 c1 e8 14 41 89 4c 85 00 69 02 b1 79 37 9e c1 e8 14 49 8d 4>
May 18 08:35:18 pve-01-1244 kernel: RSP: 0018:ffffac8d28247ca0 EFLAGS: 00010a02
May 18 08:35:18 pve-01-1244 kernel: RAX: 0000000037333a37 RBX: ffffac8d083cf000 RCX: 00000000000134be
May 18 08:35:18 pve-01-1244 kernel: RDX: ffffac8d083e24c0 RSI: ffffac8d0d001000 RDI: ffffac8d083ef000
May 18 08:35:18 pve-01-1244 kernel: RBP: ffffac8d28247cf8 R08: 00000000000134c0 R09: ffffac8d083eeff4
May 18 08:35:18 pve-01-1244 kernel: R10: 37302d2037333a37 R11: ffffac8d083e1880 R12: 0000000000020000
May 18 08:35:18 pve-01-1244 kernel: R13: ffff9f1b59840000 R14: ffffac8d083e2456 R15: ffffac8d0cfe71f4
May 18 08:35:18 pve-01-1244 kernel: FS: 0000000000000000(0000) GS:ffff9f2a5e440000(0000) knlGS:0000000000000000
May 18 08:35:18 pve-01-1244 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 18 08:35:18 pve-01-1244 kernel: CR2: 0000000037333a37 CR3: 000000012a00c002 CR4: 00000000003726e0
May 18 08:35:18 pve-01-1244 kernel: Call Trace:
May 18 08:35:18 pve-01-1244 kernel: <TASK>
May 18 08:35:18 pve-01-1244 kernel: zio_compress_data+0xd2/0x120 [zfs]
May 18 08:35:18 pve-01-1244 kernel: zio_write_compress+0x552/0xa10 [zfs]
May 18 08:35:18 pve-01-1244 kernel: zio_execute+0x92/0x160 [zfs]
May 18 08:35:18 pve-01-1244 kernel: taskq_thread+0x29c/0x4d0 [spl]
May 18 08:35:18 pve-01-1244 kernel: ? wake_up_q+0x90/0x90
May 18 08:35:18 pve-01-1244 kernel: ? zio_gang_tree_free+0x70/0x70 [zfs]
May 18 08:35:18 pve-01-1244 kernel: ? taskq_thread_spawn+0x60/0x60 [spl]
May 18 08:35:18 pve-01-1244 kernel: kthread+0x127/0x150
May 18 08:35:18 pve-01-1244 kernel: ? set_kthread_struct+0x50/0x50
May 18 08:35:18 pve-01-1244 kernel: ret_from_fork+0x1f/0x30
May 18 08:35:18 pve-01-1244 kernel: </TASK>
May 18 08:35:18 pve-01-1244 kernel: Modules linked in: cmac nls_utf8 cifs cifs_arc4 cifs_md4 fscache netfs veth ebtable_filter ebtables ip_set ip6table_raw ipta>
May 18 08:35:18 pve-01-1244 kernel: intel_cstate pcspkr efi_pstore i2c_algo_bit ee1004 fb_sys_fops snd syscopyarea mei_me sysfillrect soundcore sysimgblt mei i>
May 18 08:35:18 pve-01-1244 kernel: CR2: 0000000037333a37
May 18 08:35:18 pve-01-1244 kernel: ---[ end trace d594fd094fb45adc ]---
May 18 08:35:18 pve-01-1244 kernel: RIP: 0010:lz4_compress_zfs+0x5d5/0x7b0 [zfs]
May 18 08:35:18 pve-01-1244 kernel: Code: 69 42 fe b1 79 37 9e 48 8d 4a fe 49 89 d0 48 29 d9 49 29 d8 c1 e8 14 41 89 4c 85 00 69 02 b1 79 37 9e c1 e8 14 49 8d 4>
May 18 08:35:18 pve-01-1244 kernel: RSP: 0018:ffffac8d28247ca0 EFLAGS: 00010a02
May 18 08:35:18 pve-01-1244 kernel: RAX: 0000000037333a37 RBX: ffffac8d083cf000 RCX: 00000000000134be
May 18 08:35:18 pve-01-1244 kernel: RDX: ffffac8d083e24c0 RSI: ffffac8d0d001000 RDI: ffffac8d083ef000
May 18 08:35:18 pve-01-1244 kernel: RBP: ffffac8d28247cf8 R08: 00000000000134c0 R09: ffffac8d083eeff4
May 18 08:35:18 pve-01-1244 kernel: R10: 37302d2037333a37 R11: ffffac8d083e1880 R12: 0000000000020000
May 18 08:35:18 pve-01-1244 kernel: R13: ffff9f1b59840000 R14: ffffac8d083e2456 R15: ffffac8d0cfe71f4
May 18 08:35:18 pve-01-1244 kernel: FS: 0000000000000000(0000) GS:ffff9f2a5e440000(0000) knlGS:0000000000000000
May 18 08:35:18 pve-01-1244 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 18 08:35:18 pve-01-1244 kernel: CR2: 0000000037333a37 CR3: 000000012a00c002 CR4: 00000000003726e0