[SOLVED] ZFS causing NULL pointer dereference, freezing LXC container

Derock · Jul 27, 2024

I'm not entirely sure how to reproduce this, but it has occurred twice now. Every time this occurs, one LXC container freezes. Cannot spawn any new processes that require disk access. Cannot gracefully shut it down either (TASK ERROR: container did not stop) nor can I stop it (task still running after 15 minutes)

Running proxmox-kernel-6.8 (6.8.8-3), zfs-2.2.4-pve1, zfs-kmod-2.2.4-pve1. I have tried reinstalling and regenerating the kernel/initramfs and have confirmed the hashes are correct for the installed apt packages using the debsums tool.

Code:

root@goliath:~# zpool status
  pool: main
 state: ONLINE
  scan: scrub repaired 0B in 13:22:46 with 0 errors on Sun Jul 14 13:46:47 2024
config:

        NAME                                 STATE     READ WRITE CKSUM
        main                                 ONLINE       0     0     0
          raidz1-0                           ONLINE       0     0     0
            ata-ST8000VN004-3CP101_WWZ3M5WD  ONLINE       0     0     0
            ata-ST8000VN004-3CP101_WWZ3GKMM  ONLINE       0     0     0
            ata-ST8000VN004-3CP101_WWZ3LFGQ  ONLINE       0     0     0

errors: No known data errors
root@goliath:~# pvesm status
Name              Type     Status           Total            Used       Available        %
local              dir     active        98497780        15735436        77712796   15.98%
zfs-images         dir     active      6621839616        10294912      6611544704    0.16%
zfs-vms        zfspool     active      7587128145       975583337      6611544808   12.86%

Relevant dmesg entry:

Code:

[ 1710.522790] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 1710.522811] #PF: supervisor read access in kernel mode
[ 1710.522822] #PF: error_code(0x0000) - not-present page
[ 1710.522833] PGD 1134f2067 P4D 1134f2067 PUD 13b330067 PMD 0
[ 1710.522848] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 1710.522861] CPU: 7 PID: 1358 Comm: txg_sync Tainted: P           O       6.8.8-3-pve #1
[ 1710.522875] Hardware name: Gigabyte Technology Co., Ltd. B550I AORUS PRO AX/B550I AORUS PRO AX, BIOS FCc 09/20/2023
[ 1710.522892] RIP: 0010:arc_release+0x16/0x570 [zfs]
[ 1710.523072] Code: 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54 53 48 83 ec 40 <48> 8b 1f 48 81 7b 60 80 b8 ba c0 0f 84 fc 03 00 00 48 8b 33 48 8b
[ 1710.523099] RSP: 0018:ffffaba352b8fa08 EFLAGS: 00010282
[ 1710.523111] RAX: 0000000000000002 RBX: ffff931c6d06ef78 RCX: 0000000000000000
[ 1710.523123] RDX: dead000000000100 RSI: ffff932671fa8c40 RDI: 0000000000000000
[ 1710.523135] RBP: ffffaba352b8fa70 R08: 0000000000000000 R09: 0000000000000000
[ 1710.523147] R10: 0000000000000000 R11: 0000000000000000 R12: ffff932694301c00
[ 1710.523160] R13: ffff9325accd8480 R14: ffff932671fa8c40 R15: 0000000000000000
[ 1710.523172] FS:  0000000000000000(0000) GS:ffff932b3e580000(0000) knlGS:0000000000000000
[ 1710.523185] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1710.523197] CR2: 0000000000000000 CR3: 000000010650e000 CR4: 0000000000350ef0
[ 1710.523210] Call Trace:
[ 1710.523218]  <TASK>
[ 1710.523226]  ? show_regs+0x6d/0x80
[ 1710.523238]  ? __die+0x24/0x80
[ 1710.523247]  ? page_fault_oops+0x176/0x500
[ 1710.523484]  ? srso_return_thunk+0x5/0x5f
[ 1710.523702]  ? do_user_addr_fault+0x2f9/0x6b0
[ 1710.523917]  ? exc_page_fault+0x83/0x1b0
[ 1710.524125]  ? asm_exc_page_fault+0x27/0x30
[ 1710.524332]  ? arc_release+0x16/0x570 [zfs]
[ 1710.524689]  ? spl_kvmalloc+0x84/0xc0 [spl]
[ 1710.524900]  ? srso_return_thunk+0x5/0x5f
[ 1710.525099]  ? spl_kvmalloc+0x84/0xc0 [spl]
[ 1710.525308]  ? srso_return_thunk+0x5/0x5f
[ 1710.525507]  dbuf_dirty+0x366/0x930 [zfs]
[ 1710.525867]  dmu_buf_will_dirty_impl+0xd0/0x240 [zfs]
[ 1710.526217]  dmu_buf_will_dirty+0x16/0x30 [zfs]
[ 1710.526557]  dmu_write_impl+0x48/0xf0 [zfs]
[ 1710.526902]  dmu_write+0xdc/0x190 [zfs]
[ 1710.527243]  space_map_write+0x15a/0x9d0 [zfs]
[ 1710.527590]  ? srso_return_thunk+0x5/0x5f
[ 1710.527770]  ? srso_return_thunk+0x5/0x5f
[ 1710.527945]  ? space_map_estimate_optimal_size+0x170/0x1d0 [zfs]
[ 1710.528288]  ? metaslab_should_condense+0xaa/0x100 [zfs]
[ 1710.528631]  metaslab_flush+0xde/0x370 [zfs]
[ 1710.528965]  ? metaslab_unflushed_bump+0x123/0x170 [zfs]
[ 1710.529296]  spa_flush_metaslabs+0x1d1/0x450 [zfs]
[ 1710.529620]  spa_sync+0x624/0x1030 [zfs]
[ 1710.529942]  ? srso_return_thunk+0x5/0x5f
[ 1710.530094]  ? spa_txg_history_init_io+0x120/0x130 [zfs]
[ 1710.530411]  txg_sync_thread+0x207/0x3a0 [zfs]
[ 1710.530726]  ? __pfx_txg_sync_thread+0x10/0x10 [zfs]
[ 1710.531043]  ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
[ 1710.531202]  thread_generic_wrapper+0x5f/0x70 [spl]
[ 1710.531358]  kthread+0xf2/0x120
[ 1710.531502]  ? __pfx_kthread+0x10/0x10
[ 1710.531642]  ret_from_fork+0x47/0x70
[ 1710.531793]  ? __pfx_kthread+0x10/0x10
[ 1710.531938]  ret_from_fork_asm+0x1b/0x30
[ 1710.532087]  </TASK>
[ 1710.532228] Modules linked in: tcp_diag inet_diag nf_conntrack_netlink xt_nat xt_conntrack xfrm_user xfrm_algo xt_addrtype act_police cls_basic sch_ingress sch_htb nft_compat nft_chain_nat nfsd auth_rpcgss nfs_acl lockd grace xt_MASQUERADE xt_tcpudp xt_mark veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw overlay nf_tables ip6table_nat ip6table_filter ip6_tables iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter bonding tls softdog sunrpc nfnetlink_log nfnetlink binfmt_misc intel_rapl_msr intel_rapl_common rtw89_8852ce rtw89_8852c edac_mce_amd rtw89_pci snd_hda_codec_realtek rtw89_core snd_hda_codec_generic kvm_amd snd_hda_codec_hdmi kvm snd_hda_intel irqbypass snd_intel_dspcfg crct10dif_pclmul snd_intel_sdw_acpi polyval_clmulni polyval_generic mac80211 btusb ghash_clmulni_intel btrtl snd_hda_codec sha256_ssse3 btintel sha1_ssse3 aesni_intel btbcm snd_hda_core btmtk snd_hwdep crypto_simd cryptd snd_pcm bluetooth snd_timer cfg80211 snd ecdh_generic ccp libarc4 soundcore ecc rapl
[ 1710.532314]  wmi_bmof gigabyte_wmi pcspkr k10temp mac_hid zfs(PO) spl(O) vhost_net vhost vhost_iotlb tap efi_pstore dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c hid_generic usbhid hid xhci_pci xhci_pci_renesas crc32_pclmul r8169 ahci xhci_hcd i2c_piix4 realtek libahci wmi gpio_amdpt
[ 1710.535033] CR2: 0000000000000000
[ 1710.535246] ---[ end trace 0000000000000000 ]---
[ 1710.757742] RIP: 0010:arc_release+0x16/0x570 [zfs]
[ 1710.758270] Code: 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54 53 48 83 ec 40 <48> 8b 1f 48 81 7b 60 80 b8 ba c0 0f 84 fc 03 00 00 48 8b 33 48 8b
[ 1710.758980] RSP: 0018:ffffaba352b8fa08 EFLAGS: 00010282
[ 1710.759221] RAX: 0000000000000002 RBX: ffff931c6d06ef78 RCX: 0000000000000000
[ 1710.759464] RDX: dead000000000100 RSI: ffff932671fa8c40 RDI: 0000000000000000
[ 1710.759720] RBP: ffffaba352b8fa70 R08: 0000000000000000 R09: 0000000000000000
[ 1710.759988] R10: 0000000000000000 R11: 0000000000000000 R12: ffff932694301c00
[ 1710.760229] R13: ffff9325accd8480 R14: ffff932671fa8c40 R15: 0000000000000000
[ 1710.760479] FS:  0000000000000000(0000) GS:ffff932b3e580000(0000) knlGS:0000000000000000
[ 1710.760727] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1710.760977] CR2: 0000000000000000 CR3: 000000010650e000 CR4: 0000000000350ef0

My preliminary googling shows `txg_sync` (the process in context during the fault) is used to flush data the disks, but I do not believe I have a broken disk. All my drives are plugged directly into the motherboard and have no reported reallocated sectors or uncorrectable errors according to smartctl. They are Seagate IronWolf drives and are less than a year old.

Would love to have some pointers on how I can debug this issue.

edit: seems like this nullptr deref causes ZFS to hang up entirely. Any process that attempts to write data to the zfs mount will hang. Should I open an issue on the ZFS project's GitHub?

meyergru · Jul 27, 2024

Just an idea because you use AMD hardware:

First thing I would try is to scrub the pool and verify that it is still fine. I had a problem with a raidz2 pool with AMD hardware when I created high write loads. It turned out to be timeouts in the SATA NCQ handling. It seems like the queue was never fully processed. Essentially, if you have 16 entries in the queue and fill that up, you get an interrupt that tells you that a few entries have been processed. If you then insert new requests, they can get reordered again and there may be requests that "never" get processed. After a short timeout, this is counted as an error.

Alas, you cannot really see this because of a bad implementation of the SATA -> SCSI translation layer, so you cannot diagnose the real cause in the logs.

The only remedy I could find was to disable NCQ by adding "libata=noncq" to the kernel commandline. I doubt that it has a large impact on performance, because ZFS orders the requests anyways.

I never had this problem on Intel hardware, BTW. And it is not a matter of the SATA controller, I tried several chipsets - it literally took me weeks to find this.

Derock · Jul 30, 2024

I did recently introduce a new high write load, but it isn't writing to the pool in question. It's writing to a different hdd that's mounted to the system. It does perform a lot of reads off the pool though.

I've gone ahead and added `libata=noncq` to my grub config to disable NCQ, and I also re-routed the SATA cables to hopefully eliminate any loose connectors and/or unintended stress on the cable or connector. Ran two scrubs, first one found a few corrupted files (likely due to the random freezing interrupting writes), but after removing those files, the second scrub finished without any issues and the system has been stable since.

Thanks for your help!

Search

Search

[SOLVED] ZFS causing NULL pointer dereference, freezing LXC container

Derock

New Member

meyergru

Active Member

Derock

New Member

We value your privacy