Intermittent kernel panic: PANIC at dnode_sync.c:308:free_children() (Proxmox 8.4.14 / OpenZFS)

CacheMeOutside

New Member
Oct 22, 2025
1
0
1
I’m seeing repeatable kernel panics (couple times a month) on a Proxmox node. The panic is always:

PANIC at dnode_sync.c:308:free_children()

This started recently after years of stability. Several VMs on this host automatically revert to a “clean_state” snapshot on a schedule (has been true for years), so there are frequent zfs rollback operations—but the problem is new.

I have other Proxmox nodes with similar workloads; only this node has ever hit this panic.

Environment
  • Proxmox VE 8.x
  • Kernel: 6.8.12-15-pve
  • OpenZFS: 2.2.8-pve1
  • RAM: ~768 GiB (ARC capped at ~16 GiB via zfs_arc_max)
  • Storage: multiple ZFS pools on NVMe and SATA SSDs (some mirrors, some single-device vdevs). Regular scrubs show 0 errors.
  • Networking: Broadcom NICs using bnxt_en; kernel logs also show RoCE/IB (bnxt_re) messages (see below).
Symptoms / timeline
  • Panic occurs during regular operation (not only under heavy load).
  • I often see a burst of VM TAP interfaces flapping (VM restarts or snapshot rollbacks) and repeated lines like:
    bnxt_en … bnxt_re1: Failed to add GID: 0xffffff92
    __ib_cache_gid_add: … error=-110
    a minute or two before the ZFS assertion/panic.
  • It’s not tied to one specific VM or a single pool; multiple pools are busy with frequent rollbacks.
  • System is fully up to date. The issue persisted after updates.
  • Since the last panic the node has been stable, but the pattern has repeated on previous weeks.

Representative log snippet

… bnxt_en … bnxt_re1: Failed to add GID: 0xffffff92
… __ib_cache_gid_add: … error=-110
… tap<vmid>i0: entered/left modes; vmbr0 port entered forwarding state
VERIFY(BP_GET_FILL(db->db_blkptr) == 0 || db->db_dirtycnt > 0) failed
PANIC at dnode_sync.c:308:free_children()

What I checked
  • zpool status -P across all pools: No known data errors; scrubs clean.
  • arc_summary shows ARC ≈16 GiB target; ARC looks healthy; no memory throttling.
  • Frequent zfs rollback events due to scheduled VM snapshot resets (expected).
  • No SMART/media errors reported by drives.
  • NIC logs show recurring RoCE GID-add failures (bnxt_re); RoCE isn’t intentionally used.
  • I did update to the latest packages available on October 10 and it crash on October 13
Questions for the community
  1. Has anyone hit the free_children() assertion on OpenZFS 2.2.x under heavy zfs rollback / clone / destroy churn (typical CI workloads)? Any known fixes or patches?
  2. Could the bnxt_re (RoCE) GID-add errors plausibly contribute (e.g., timing, memory pressure, IRQ storms), or is this just noise?
  3. Recommended next steps to narrow down root cause?

Any pointers or reports of similar experiences would be greatly appreciated. Thanks!