Hey everyone,
It has been about a week since I upgraded my homelab HP ML350 G6 from Proxmox 5.4 to 6.2. Upgrade went flawlessly, which was awesome, but the change in ZFS version seems to have introduced some nasty issues that result in a partly unresponsive system after (as near as I can tell) a ZFS scrub (every Sunday), or sometimes after a snapshot operation on a specific LXC container. It seems to most heavily affect a Plex Media Server install running in an LXC container of Ubuntu 18.04, but I'm not sure this is relevant.
I get messages such as this in my kernel log:
After this, I can't shut down that container or otherwise kill the affected process (in this case, Plex Media Server), and it is totally unresponsive. The rest of the system "seems" to work. However, to get that container working again, I need to essentially send the machine into reboot. Once all the essential services are shut down, it will hang while waiting for the hung container to shut down (and it never will). I am forced to use Alt-SysRq-B to force a hard reboot.
I have read some other threads that mention
It has been about a week since I upgraded my homelab HP ML350 G6 from Proxmox 5.4 to 6.2. Upgrade went flawlessly, which was awesome, but the change in ZFS version seems to have introduced some nasty issues that result in a partly unresponsive system after (as near as I can tell) a ZFS scrub (every Sunday), or sometimes after a snapshot operation on a specific LXC container. It seems to most heavily affect a Plex Media Server install running in an LXC container of Ubuntu 18.04, but I'm not sure this is relevant.
I get messages such as this in my kernel log:
Code:
[425151.750185] general protection fault: 0000 [#1] SMP PTI
[425151.750420] CPU: 14 PID: 19490 Comm: Plex Media Serv Tainted: P IOE 5.4.44-1-pve #1
[425151.750677] Hardware name: HP ProLiant ML350 G6, BIOS D22 05/05/2011
[425151.750876] RIP: 0010:avl_walk+0x33/0x70 [zavl]
[425151.751014] Code: 10 b9 01 00 00 00 29 d1 4c 01 c6 48 89 e5 48 85 f6 74 48 48 63 d2 48 89 f7 48 8b 04 d6 48 85 c0 74 20 48 63 c9 eb 03 48 89 d0 <48> 8b 14 c8 48 85 d2 75 f4 48 89 c2 48 89 d0 5d 4c 29 c0 c3 39 f1
[425151.751604] RSP: 0018:ffffa8925fe47c80 EFLAGS: 00010282
[425151.751759] RAX: e089443875c085d0 RBX: ffff9b8b1c780940 RCX: 0000000000000000
[425151.751970] RDX: 0000000000000001 RSI: ffffffffc04b0478 RDI: ffffffffc04b0478
[425151.752180] RBP: ffffa8925fe47c80 R08: 0000000000000008 R09: ffff9b8cc3406f40
[425151.752398] R10: ffff9b893e776900 R11: 0000008000000000 R12: ffff9b893e776900
[425151.752639] R13: ffff9b8b1c780968 R14: 0000000000000000 R15: 0000000000000000
[425151.752849] FS: 00007f6ffd7fa700(0000) GS:ffff9b8cc39c0000(0000) knlGS:0000000000000000
[425151.753088] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[425151.753258] CR2: 00007f70bd0c0000 CR3: 0000000994900000 CR4: 00000000000006e0
[425151.753467] Call Trace:
[425151.753585] avl_nearest+0x2a/0x30 [zavl]
[425151.753780] zfs_rangelock_enter+0x405/0x580 [zfs]
[425151.753932] ? spl_kmem_zalloc+0xe9/0x140 [spl]
[425151.754071] ? spl_kmem_zalloc+0xe9/0x140 [spl]
[425151.754257] zfs_get_data+0x157/0x340 [zfs]
[425151.754432] zil_commit_impl+0x9ad/0xd90 [zfs]
[425151.754627] zil_commit+0x3d/0x60 [zfs]
[425151.754802] zfs_fsync+0x77/0xe0 [zfs]
[425151.754965] zpl_fsync+0x68/0xa0 [zfs]
[425151.755083] vfs_fsync_range+0x48/0x80
[425151.755197] ? __fget_light+0x59/0x70
[425151.755307] do_fsync+0x3d/0x70
[425151.755402] __x64_sys_fsync+0x14/0x20
[425151.755518] do_syscall_64+0x57/0x190
[425151.755632] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[425151.755826] RIP: 0033:0x7f70ba41cb07
[425151.755934] Code: 00 00 0f 05 48 3d 00 f0 ff ff 77 3f f3 c3 0f 1f 44 00 00 53 89 fb 48 83 ec 10 e8 04 f5 ff ff 89 df 89 c2 b8 4a 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 2b 89 d7 89 44 24 0c e8 46 f5 ff ff 8b 44 24
[425151.756477] RSP: 002b:00007f6ffd7f89c0 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
[425151.756700] RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007f70ba41cb07
[425151.756952] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 000000000000000c
[425151.757162] RBP: 000000000295f178 R08: 0000000000000000 R09: 00007f70282cb270
[425151.757374] R10: 00007f7028029fc0 R11: 0000000000000293 R12: 0000000000000000
[425151.757583] R13: 000000000296da58 R14: 0000000000000002 R15: 0000000000000000
[425151.757794] Modules linked in: xt_recent(E) xt_conntrack(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) veth(E) ebtable_filter(E) ebtables(E) ip_set(E) ip6table_raw(E) iptable_raw(E) ip6table_filter(E) ip6_tables(E) mptctl(E) mptbase(E) binfmt_misc(E) iptable_filter(E) bpfilter(E) cpuid(E) softdog(E) nfnetlink_log(E) nfnetlink(E) radeon(E) ttm(E) drm_kms_helper(E) drm(E) intel_powerclamp(E) i2c_algo_bit(E) fb_sys_fops(E) syscopyarea(E) kvm_intel(E) sysfillrect(E) sysimgblt(E) kvm(E) usblp(E) ipmi_ssif(E) input_leds(E) pcspkr(E) irqbypass(E) i7core_edac(E) hpilo(E) serio_raw(E) ipmi_si(E) intel_cstate(E) ipmi_devintf(E) ipmi_msghandler(E) mac_hid(E) vhost_net(E) vhost(E) tap(E) ib_iser(E) rdma_cm(E) iw_cm(E) ib_cm(E) ib_core(E) iscsi_tcp(E) libiscsi_tcp(E) libiscsi(E) scsi_transport_iscsi(E) coretemp(E) parport_pc(E) ppdev(E) lp(E) sunrpc(E) parport(E) ip_tables(E) x_tables(E) autofs4(E) zfs(POE) zunicode(POE) zlua(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) btrfs(E)
[425151.757832] xor(E) zstd_compress(E) raid6_pq(E) libcrc32c(E) hid_generic(E) usbmouse(E) usbkbd(E) usbhid(E) hid(E) gpio_ich(E) psmouse(E) lpc_ich(E) pata_acpi(E) ehci_pci(E) uhci_hcd(E) tg3(E) ehci_hcd(E) hpsa(E) scsi_transport_sas(E)
[425151.761103] ---[ end trace 1a7bd96ef6c5612f ]---
[425151.761247] RIP: 0010:avl_walk+0x33/0x70 [zavl]
[425151.761382] Code: 10 b9 01 00 00 00 29 d1 4c 01 c6 48 89 e5 48 85 f6 74 48 48 63 d2 48 89 f7 48 8b 04 d6 48 85 c0 74 20 48 63 c9 eb 03 48 89 d0 <48> 8b 14 c8 48 85 d2 75 f4 48 89 c2 48 89 d0 5d 4c 29 c0 c3 39 f1
[425151.761925] RSP: 0018:ffffa8925fe47c80 EFLAGS: 00010282
[425151.775112] RAX: e089443875c085d0 RBX: ffff9b8b1c780940 RCX: 0000000000000000
[425151.788577] RDX: 0000000000000001 RSI: ffffffffc04b0478 RDI: ffffffffc04b0478
[425151.802110] RBP: ffffa8925fe47c80 R08: 0000000000000008 R09: ffff9b8cc3406f40
[425151.815599] R10: ffff9b893e776900 R11: 0000008000000000 R12: ffff9b893e776900
[425151.828837] R13: ffff9b8b1c780968 R14: 0000000000000000 R15: 0000000000000000
[425151.841958] FS: 00007f6ffd7fa700(0000) GS:ffff9b8cc39c0000(0000) knlGS:0000000000000000
[425151.855228] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[425151.868544] CR2: 00007f70bd0c0000 CR3: 0000000994900000 CR4: 00000000000006e0
After this, I can't shut down that container or otherwise kill the affected process (in this case, Plex Media Server), and it is totally unresponsive. The rest of the system "seems" to work. However, to get that container working again, I need to essentially send the machine into reboot. Once all the essential services are shut down, it will hang while waiting for the hung container to shut down (and it never will). I am forced to use Alt-SysRq-B to force a hard reboot.
I have read some other threads that mention
zfs_vdev_scheduler=none
, which - as I understand it - was completely removed from 0.8.4 to solve similar issues. So this shouldn't be the case for me. Any suggestions?