I have been experiencing daily GPFs since upgrading a stable platform to 6.1. Once or twice per day, the machine goes into a series of soft lockups that can only be recovered by power cycling the box. The problem appears to be ZFS related as it is usually triggered immediately after a znapzend backup is initiated however multiple backups happen hourly and failures occur only intermittently. If I disable the backups, lockups still happen, but marginally less frequently.
I have set all the schedulers to 'none' but am still seeing exactly the same symptoms.
Currently on 6.1-8
The failures start like this and degrade over a short period of time until the journal gets trashed:
Mar 25 16:00:27 newton-iii znapzend[37524]: sending snapshots from wiz/vm-100-disk-2 to rpool/zbackups/newton-iii/wiz/vm-100-disk-2
Mar 25 16:00:27 newton-iii zed[43029]: eid=2296 class=history_event pool_guid=0x506DF3B09AC22638
Mar 25 16:00:27 newton-iii znapzend[14227]: mbuffer: warning: HOME environment variable not set - unable to find defaults file
Mar 25 16:00:27 newton-iii zed[43268]: eid=2297 class=history_event pool_guid=0x506DF3B09AC22638
Mar 25 16:00:27 newton-iii znapzend[37394]: cleaning up snapshots on rpool/zbackups/newton-iii/hum
Mar 25 16:00:27 newton-iii znapzend[37394]: cleaning up snapshots on rpool/zbackups/newton-iii/hum/vm-101-disk-1
Mar 25 16:00:27 newton-iii zed[43648]: eid=2298 class=history_event pool_guid=0x506DF3B09AC22638
Mar 25 16:00:27 newton-iii znapzend[37394]: cleaning up snapshots on rpool/zbackups/newton-iii/hum/vm-101-disk-2
Mar 25 16:00:27 newton-iii znapzend[37394]: sending snapshots from hum to root@delozier-v.bootstrap.je:sero/zbackups/newton-iii/hum
Mar 25 16:00:27 newton-iii kernel: general protection fault: 0000 [#1] SMP NOPTI
Mar 25 16:00:27 newton-iii kernel: CPU: 29 PID: 43652 Comm: receive_writer Tainted: P O 5.3.18-2-pve #1
Mar 25 16:00:27 newton-iii kernel: Hardware name: Supermicro Super Server/X10DRG-Q, BIOS 2.0a 08/29/2016
Mar 25 16:00:27 newton-iii kernel: RIP: 0010:__kmalloc_node+0x19f/0x320
Mar 25 16:00:27 newton-iii kernel: Code: 47 0b 04 0f 84 f4 fe ff ff 4c 89 ff e8 fa cc 01 00 49 89 c1 e9 e4 fe ff ff 41 8b 59 20 49 8b 39 48 8d 4a 01 4c 89 d0 4c 01 d3 <48> 33 1b 49 33 99 70 01 00 00 65 48 0f c7 0f 0f 94 c0 84 c0 0f 84
Mar 25 16:00:27 newton-iii kernel: RSP: 0018:ffffb7bcdde77bf8 EFLAGS: 00010202
Mar 25 16:00:27 newton-iii kernel: RAX: 606a15974a93ccd9 RBX: 606a15974a93ccd9 RCX: 00000000000df02e
Mar 25 16:00:27 newton-iii kernel: RDX: 00000000000df02d RSI: 0000000000042c00 RDI: 000000000002f040
Mar 25 16:00:27 newton-iii kernel: RBP: ffffb7bcdde77c38 R08: ffff8b97ffc6f040 R09: ffff8b97ff407b80
Mar 25 16:00:27 newton-iii kernel: R10: 606a15974a93ccd9 R11: ffff8b7dcbdfeae0 R12: 0000000000042c00
Mar 25 16:00:27 newton-iii kernel: R13: 0000000000000001 R14: 00000000ffffffff R15: ffff8b97ff407b80
Mar 25 16:00:27 newton-iii kernel: FS: 0000000000000000(0000) GS:ffff8b97ffc40000(0000) knlGS:0000000000000000
Mar 25 16:00:27 newton-iii kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 25 16:00:27 newton-iii kernel: CR2: 00007f5ac65f99a0 CR3: 000000357e80a004 CR4: 00000000003626e0
Mar 25 16:00:27 newton-iii kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Mar 25 16:00:27 newton-iii kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Mar 25 16:00:27 newton-iii kernel: Call Trace:
Mar 25 16:00:27 newton-iii kernel: ? spl_kmem_alloc+0xec/0x140 [spl]
Mar 25 16:00:27 newton-iii kernel: spl_kmem_alloc+0xec/0x140 [spl]
Mar 25 16:00:27 newton-iii kernel: dbuf_dirty+0x107/0x830 [zfs]
Mar 25 16:00:27 newton-iii kernel: dmu_buf_will_dirty_impl+0x11a/0x130 [zfs]
Mar 25 16:00:27 newton-iii kernel: dmu_buf_will_dirty+0x16/0x20 [zfs]
Mar 25 16:00:27 newton-iii kernel: receive_object+0x56e/0xbc0 [zfs]
Mar 25 16:00:27 newton-iii kernel: ? dnode_rele+0x3b/0x40 [zfs]
Mar 25 16:00:27 newton-iii kernel: ? _cond_resched+0x19/0x30
Mar 25 16:00:27 newton-iii kernel: ? mutex_lock+0x12/0x30
Mar 25 16:00:27 newton-iii kernel: receive_writer_thread+0x201/0xb80 [zfs]
Mar 25 16:00:27 newton-iii kernel: ? set_curr_task_fair+0x2b/0x60
Mar 25 16:00:27 newton-iii kernel: ? spl_kmem_free+0x33/0x40 [spl]
Mar 25 16:00:27 newton-iii kernel: ? receive_read_prefetch+0x140/0x140 [zfs]
Mar 25 16:00:27 newton-iii kernel: thread_generic_wrapper+0x74/0x90 [spl]
Mar 25 16:00:27 newton-iii kernel: ? receive_read_prefetch+0x140/0x140 [zfs]
Mar 25 16:00:27 newton-iii kernel: ? thread_generic_wrapper+0x74/0x90 [spl]
Mar 25 16:00:27 newton-iii kernel: kthread+0x120/0x140
Mar 25 16:00:27 newton-iii kernel: ? __thread_exit+0x20/0x20 [spl]
Mar 25 16:00:27 newton-iii kernel: ? __kthread_parkme+0x70/0x70
Mar 25 16:00:27 newton-iii kernel: ret_from_fork+0x35/0x40
Mar 25 16:00:27 newton-iii kernel: Modules linked in: veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter bpfilter softdog nfnetlink_log nfnetlink ast snd_hda_codec_realtek drm_vram_helper intel_rapl_msr snd_hda_codec_generic ledtrig_audio intel_rapl_common ttm sb_edac snd_hda_intel x86_pkg_temp_thermal snd_intel_nhlt intel_powerclamp drm_kms_helper snd_hda_codec uvcvideo coretemp kvm_intel snd_usb_audio videobuf2_vmalloc snd_hda_core videobuf2_memops snd_
Mar 25 16:00:27 newton-iii kernel: mac_hid ipmi_msghandler acpi_power_meter acpi_pad vhost_net vhost tap ib_iser rdma_cm iw_cm ib_cm iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi vfio_pci vfio_virqfd irqbypass vfio_iommu_type1 vfio sunrpc ip_tables x_tables autofs4 zfs(PO) zunicode(PO) zlua(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) btrfs xor zstd_compress raid6_pq libcrc32c mlx4_ib ib_uverbs ib_core mlx4_en hid_logitech_hidpp hid_apple hid_logitech_dj usbmouse usbkbd hid_generic usbhid hi
Mar 25 16:00:27 newton-iii kernel: ---[ end trace f24480b659ccd02e ]---
Mar 25 16:00:27 newton-iii kernel: RIP: 0010:__kmalloc_node+0x19f/0x320
Mar 25 16:00:27 newton-iii kernel: Code: 47 0b 04 0f 84 f4 fe ff ff 4c 89 ff e8 fa cc 01 00 49 89 c1 e9 e4 fe ff ff 41 8b 59 20 49 8b 39 48 8d 4a 01 4c 89 d0 4c 01 d3 <48> 33 1b 49 33 99 70 01 00 00 65 48 0f c7 0f 0f 94 c0 84 c0 0f 84
Mar 25 16:00:27 newton-iii kernel: RSP: 0018:ffffb7bcdde77bf8 EFLAGS: 00010202
Mar 25 16:00:27 newton-iii kernel: RAX: 606a15974a93ccd9 RBX: 606a15974a93ccd9 RCX: 00000000000df02e
Mar 25 16:00:27 newton-iii kernel: RDX: 00000000000df02d RSI: 0000000000042c00 RDI: 000000000002f040
Mar 25 16:00:27 newton-iii kernel: RBP: ffffb7bcdde77c38 R08: ffff8b97ffc6f040 R09: ffff8b97ff407b80
Mar 25 16:00:27 newton-iii kernel: R10: 606a15974a93ccd9 R11: ffff8b7dcbdfeae0 R12: 0000000000042c00
Mar 25 16:00:27 newton-iii kernel: R13: 0000000000000001 R14: 00000000ffffffff R15: ffff8b97ff407b80
Mar 25 16:00:27 newton-iii kernel: FS: 0000000000000000(0000) GS:ffff8b97ffc40000(0000) knlGS:0000000000000000
Mar 25 16:00:27 newton-iii kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 25 16:00:27 newton-iii kernel: CR2: 00007f5ac65f99a0 CR3: 000000357e80a004 CR4: 00000000003626e0
Mar 25 16:00:27 newton-iii kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Mar 25 16:00:27 newton-iii kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Mar 25 16:00:27 newton-iii zed[43705]: eid=2299 class=history_event pool_guid=0x506DF3B09AC22638
Mar 25 16:00:28 newton-iii zed[43740]: eid=2300 class=history_event pool_guid=0x506DF3B09AC22638
I have set all the schedulers to 'none' but am still seeing exactly the same symptoms.
Currently on 6.1-8
The failures start like this and degrade over a short period of time until the journal gets trashed:
Mar 25 16:00:27 newton-iii znapzend[37524]: sending snapshots from wiz/vm-100-disk-2 to rpool/zbackups/newton-iii/wiz/vm-100-disk-2
Mar 25 16:00:27 newton-iii zed[43029]: eid=2296 class=history_event pool_guid=0x506DF3B09AC22638
Mar 25 16:00:27 newton-iii znapzend[14227]: mbuffer: warning: HOME environment variable not set - unable to find defaults file
Mar 25 16:00:27 newton-iii zed[43268]: eid=2297 class=history_event pool_guid=0x506DF3B09AC22638
Mar 25 16:00:27 newton-iii znapzend[37394]: cleaning up snapshots on rpool/zbackups/newton-iii/hum
Mar 25 16:00:27 newton-iii znapzend[37394]: cleaning up snapshots on rpool/zbackups/newton-iii/hum/vm-101-disk-1
Mar 25 16:00:27 newton-iii zed[43648]: eid=2298 class=history_event pool_guid=0x506DF3B09AC22638
Mar 25 16:00:27 newton-iii znapzend[37394]: cleaning up snapshots on rpool/zbackups/newton-iii/hum/vm-101-disk-2
Mar 25 16:00:27 newton-iii znapzend[37394]: sending snapshots from hum to root@delozier-v.bootstrap.je:sero/zbackups/newton-iii/hum
Mar 25 16:00:27 newton-iii kernel: general protection fault: 0000 [#1] SMP NOPTI
Mar 25 16:00:27 newton-iii kernel: CPU: 29 PID: 43652 Comm: receive_writer Tainted: P O 5.3.18-2-pve #1
Mar 25 16:00:27 newton-iii kernel: Hardware name: Supermicro Super Server/X10DRG-Q, BIOS 2.0a 08/29/2016
Mar 25 16:00:27 newton-iii kernel: RIP: 0010:__kmalloc_node+0x19f/0x320
Mar 25 16:00:27 newton-iii kernel: Code: 47 0b 04 0f 84 f4 fe ff ff 4c 89 ff e8 fa cc 01 00 49 89 c1 e9 e4 fe ff ff 41 8b 59 20 49 8b 39 48 8d 4a 01 4c 89 d0 4c 01 d3 <48> 33 1b 49 33 99 70 01 00 00 65 48 0f c7 0f 0f 94 c0 84 c0 0f 84
Mar 25 16:00:27 newton-iii kernel: RSP: 0018:ffffb7bcdde77bf8 EFLAGS: 00010202
Mar 25 16:00:27 newton-iii kernel: RAX: 606a15974a93ccd9 RBX: 606a15974a93ccd9 RCX: 00000000000df02e
Mar 25 16:00:27 newton-iii kernel: RDX: 00000000000df02d RSI: 0000000000042c00 RDI: 000000000002f040
Mar 25 16:00:27 newton-iii kernel: RBP: ffffb7bcdde77c38 R08: ffff8b97ffc6f040 R09: ffff8b97ff407b80
Mar 25 16:00:27 newton-iii kernel: R10: 606a15974a93ccd9 R11: ffff8b7dcbdfeae0 R12: 0000000000042c00
Mar 25 16:00:27 newton-iii kernel: R13: 0000000000000001 R14: 00000000ffffffff R15: ffff8b97ff407b80
Mar 25 16:00:27 newton-iii kernel: FS: 0000000000000000(0000) GS:ffff8b97ffc40000(0000) knlGS:0000000000000000
Mar 25 16:00:27 newton-iii kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 25 16:00:27 newton-iii kernel: CR2: 00007f5ac65f99a0 CR3: 000000357e80a004 CR4: 00000000003626e0
Mar 25 16:00:27 newton-iii kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Mar 25 16:00:27 newton-iii kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Mar 25 16:00:27 newton-iii kernel: Call Trace:
Mar 25 16:00:27 newton-iii kernel: ? spl_kmem_alloc+0xec/0x140 [spl]
Mar 25 16:00:27 newton-iii kernel: spl_kmem_alloc+0xec/0x140 [spl]
Mar 25 16:00:27 newton-iii kernel: dbuf_dirty+0x107/0x830 [zfs]
Mar 25 16:00:27 newton-iii kernel: dmu_buf_will_dirty_impl+0x11a/0x130 [zfs]
Mar 25 16:00:27 newton-iii kernel: dmu_buf_will_dirty+0x16/0x20 [zfs]
Mar 25 16:00:27 newton-iii kernel: receive_object+0x56e/0xbc0 [zfs]
Mar 25 16:00:27 newton-iii kernel: ? dnode_rele+0x3b/0x40 [zfs]
Mar 25 16:00:27 newton-iii kernel: ? _cond_resched+0x19/0x30
Mar 25 16:00:27 newton-iii kernel: ? mutex_lock+0x12/0x30
Mar 25 16:00:27 newton-iii kernel: receive_writer_thread+0x201/0xb80 [zfs]
Mar 25 16:00:27 newton-iii kernel: ? set_curr_task_fair+0x2b/0x60
Mar 25 16:00:27 newton-iii kernel: ? spl_kmem_free+0x33/0x40 [spl]
Mar 25 16:00:27 newton-iii kernel: ? receive_read_prefetch+0x140/0x140 [zfs]
Mar 25 16:00:27 newton-iii kernel: thread_generic_wrapper+0x74/0x90 [spl]
Mar 25 16:00:27 newton-iii kernel: ? receive_read_prefetch+0x140/0x140 [zfs]
Mar 25 16:00:27 newton-iii kernel: ? thread_generic_wrapper+0x74/0x90 [spl]
Mar 25 16:00:27 newton-iii kernel: kthread+0x120/0x140
Mar 25 16:00:27 newton-iii kernel: ? __thread_exit+0x20/0x20 [spl]
Mar 25 16:00:27 newton-iii kernel: ? __kthread_parkme+0x70/0x70
Mar 25 16:00:27 newton-iii kernel: ret_from_fork+0x35/0x40
Mar 25 16:00:27 newton-iii kernel: Modules linked in: veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter bpfilter softdog nfnetlink_log nfnetlink ast snd_hda_codec_realtek drm_vram_helper intel_rapl_msr snd_hda_codec_generic ledtrig_audio intel_rapl_common ttm sb_edac snd_hda_intel x86_pkg_temp_thermal snd_intel_nhlt intel_powerclamp drm_kms_helper snd_hda_codec uvcvideo coretemp kvm_intel snd_usb_audio videobuf2_vmalloc snd_hda_core videobuf2_memops snd_
Mar 25 16:00:27 newton-iii kernel: mac_hid ipmi_msghandler acpi_power_meter acpi_pad vhost_net vhost tap ib_iser rdma_cm iw_cm ib_cm iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi vfio_pci vfio_virqfd irqbypass vfio_iommu_type1 vfio sunrpc ip_tables x_tables autofs4 zfs(PO) zunicode(PO) zlua(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) btrfs xor zstd_compress raid6_pq libcrc32c mlx4_ib ib_uverbs ib_core mlx4_en hid_logitech_hidpp hid_apple hid_logitech_dj usbmouse usbkbd hid_generic usbhid hi
Mar 25 16:00:27 newton-iii kernel: ---[ end trace f24480b659ccd02e ]---
Mar 25 16:00:27 newton-iii kernel: RIP: 0010:__kmalloc_node+0x19f/0x320
Mar 25 16:00:27 newton-iii kernel: Code: 47 0b 04 0f 84 f4 fe ff ff 4c 89 ff e8 fa cc 01 00 49 89 c1 e9 e4 fe ff ff 41 8b 59 20 49 8b 39 48 8d 4a 01 4c 89 d0 4c 01 d3 <48> 33 1b 49 33 99 70 01 00 00 65 48 0f c7 0f 0f 94 c0 84 c0 0f 84
Mar 25 16:00:27 newton-iii kernel: RSP: 0018:ffffb7bcdde77bf8 EFLAGS: 00010202
Mar 25 16:00:27 newton-iii kernel: RAX: 606a15974a93ccd9 RBX: 606a15974a93ccd9 RCX: 00000000000df02e
Mar 25 16:00:27 newton-iii kernel: RDX: 00000000000df02d RSI: 0000000000042c00 RDI: 000000000002f040
Mar 25 16:00:27 newton-iii kernel: RBP: ffffb7bcdde77c38 R08: ffff8b97ffc6f040 R09: ffff8b97ff407b80
Mar 25 16:00:27 newton-iii kernel: R10: 606a15974a93ccd9 R11: ffff8b7dcbdfeae0 R12: 0000000000042c00
Mar 25 16:00:27 newton-iii kernel: R13: 0000000000000001 R14: 00000000ffffffff R15: ffff8b97ff407b80
Mar 25 16:00:27 newton-iii kernel: FS: 0000000000000000(0000) GS:ffff8b97ffc40000(0000) knlGS:0000000000000000
Mar 25 16:00:27 newton-iii kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 25 16:00:27 newton-iii kernel: CR2: 00007f5ac65f99a0 CR3: 000000357e80a004 CR4: 00000000003626e0
Mar 25 16:00:27 newton-iii kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Mar 25 16:00:27 newton-iii kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Mar 25 16:00:27 newton-iii zed[43705]: eid=2299 class=history_event pool_guid=0x506DF3B09AC22638
Mar 25 16:00:28 newton-iii zed[43740]: eid=2300 class=history_event pool_guid=0x506DF3B09AC22638