I’m seeing repeatable kernel panics (couple times a month) on a Proxmox node. The panic is always:
PANIC at dnode_sync.c:308:free_children()
This started recently after years of stability. Several VMs on this host automatically revert to a “clean_state” snapshot on a schedule (has been true for years), so there are frequent zfs rollback operations—but the problem is new.
I have other Proxmox nodes with similar workloads; only this node has ever hit this panic.
Environment
Representative log snippet
… bnxt_en … bnxt_re1: Failed to add GID: 0xffffff92
… __ib_cache_gid_add: … error=-110
… tap<vmid>i0: entered/left modes; vmbr0 port entered forwarding state
VERIFY(BP_GET_FILL(db->db_blkptr) == 0 || db->db_dirtycnt > 0) failed
PANIC at dnode_sync.c:308:free_children()
What I checked
Any pointers or reports of similar experiences would be greatly appreciated. Thanks!
PANIC at dnode_sync.c:308:free_children()
This started recently after years of stability. Several VMs on this host automatically revert to a “clean_state” snapshot on a schedule (has been true for years), so there are frequent zfs rollback operations—but the problem is new.
I have other Proxmox nodes with similar workloads; only this node has ever hit this panic.
Environment
- Proxmox VE 8.x
- Kernel: 6.8.12-15-pve
- OpenZFS: 2.2.8-pve1
- RAM: ~768 GiB (ARC capped at ~16 GiB via zfs_arc_max)
- Storage: multiple ZFS pools on NVMe and SATA SSDs (some mirrors, some single-device vdevs). Regular scrubs show 0 errors.
- Networking: Broadcom NICs using bnxt_en; kernel logs also show RoCE/IB (bnxt_re) messages (see below).
- Panic occurs during regular operation (not only under heavy load).
- I often see a burst of VM TAP interfaces flapping (VM restarts or snapshot rollbacks) and repeated lines like:
bnxt_en … bnxt_re1: Failed to add GID: 0xffffff92
__ib_cache_gid_add: … error=-110
a minute or two before the ZFS assertion/panic. - It’s not tied to one specific VM or a single pool; multiple pools are busy with frequent rollbacks.
- System is fully up to date. The issue persisted after updates.
- Since the last panic the node has been stable, but the pattern has repeated on previous weeks.
Representative log snippet
… bnxt_en … bnxt_re1: Failed to add GID: 0xffffff92
… __ib_cache_gid_add: … error=-110
… tap<vmid>i0: entered/left modes; vmbr0 port entered forwarding state
VERIFY(BP_GET_FILL(db->db_blkptr) == 0 || db->db_dirtycnt > 0) failed
PANIC at dnode_sync.c:308:free_children()
What I checked
- zpool status -P across all pools: No known data errors; scrubs clean.
- arc_summary shows ARC ≈16 GiB target; ARC looks healthy; no memory throttling.
- Frequent zfs rollback events due to scheduled VM snapshot resets (expected).
- No SMART/media errors reported by drives.
- NIC logs show recurring RoCE GID-add failures (bnxt_re); RoCE isn’t intentionally used.
- I did update to the latest packages available on October 10 and it crash on October 13
- Has anyone hit the free_children() assertion on OpenZFS 2.2.x under heavy zfs rollback / clone / destroy churn (typical CI workloads)? Any known fixes or patches?
- Could the bnxt_re (RoCE) GID-add errors plausibly contribute (e.g., timing, memory pressure, IRQ storms), or is this just noise?
- Recommended next steps to narrow down root cause?
Any pointers or reports of similar experiences would be greatly appreciated. Thanks!