PANIC at zfs_quota.c:88:zpl_get_file_info()

LostInTranslation · Jan 9, 2025

Hi,

I had some issues with file system errors some time ago (see messages here). Since there was no clear evidence I decided to exchange the RAM modules in that server to new ECC RAM modules four weeks ago. Since then the file system problems are gone, but I did have two occurences of the following error message (one on 27.12.24 and one last night):

Code:

Jan 09 00:42:46 proxmoxt kernel: VERIFY3(sa.sa_magic == SA_MAGIC) failed (8192 == 3100762)
Jan 09 00:42:46 proxmoxt kernel: PANIC at zfs_quota.c:88:zpl_get_file_info()

No login at the server possible any more.

The zpool has a plenty of space left (only about 10% used at all).

Unfortunately I do not find anything helpful in the logs.

Any advice how to narrow down the problem?

BR,
Jens

LostInTranslation · Feb 10, 2025

Dear all,

we are still fighting with this issue. We do see an completely unresponsive proxmox server every 2-3 weeks at the moment (typically at a time starting the backups to a proxmox backup sreporting the following lines in journalctl:

Code:

Feb 08 23:58:43 proxmoxt kernel: VERIFY3(sa.sa_magic == SA_MAGIC) failed (8192 == 3100762)
Feb 08 23:58:43 proxmoxt kernel: PANIC at zfs_quota.c:88:zpl_get_file_info()
Feb 08 23:58:43 proxmoxt kernel: Showing stack for process 1864533
Feb 08 23:58:43 proxmoxt kernel: CPU: 2 PID: 1864533 Comm: proxmox-backup- Tainted: P          IO       6.8.12-7-pve #1
Feb 08 23:58:43 proxmoxt kernel: Hardware name: Dell Inc. Precision WorkStation T3500  /09KPNV, BIOS A17 05/28/2013
Feb 08 23:58:43 proxmoxt kernel: Call Trace:
Feb 08 23:58:43 proxmoxt kernel:  <TASK>
Feb 08 23:58:43 proxmoxt kernel:  dump_stack_lvl+0x76/0xa0
Feb 08 23:58:43 proxmoxt kernel:  dump_stack+0x10/0x20
Feb 08 23:58:43 proxmoxt kernel:  spl_dumpstack+0x29/0x40 [spl]
Feb 08 23:58:43 proxmoxt kernel:  spl_panic+0xfc/0x120 [spl]
Feb 08 23:58:43 proxmoxt kernel:  ? dnode_cons+0x2ab/0x2d0 [zfs]
Feb 08 23:58:43 proxmoxt kernel:  zpl_get_file_info+0x23a/0x250 [zfs]
Feb 08 23:58:43 proxmoxt kernel:  dmu_objset_userquota_get_ids+0x257/0x4c0 [zfs]
Feb 08 23:58:43 proxmoxt kernel:  dnode_setdirty+0x38/0x110 [zfs]
Feb 08 23:58:43 proxmoxt kernel:  dnode_allocate+0x16b/0x1f0 [zfs]
Feb 08 23:58:43 proxmoxt kernel:  dmu_object_alloc_impl+0x36e/0x420 [zfs]
Feb 08 23:58:43 proxmoxt kernel:  ? __kmalloc_node+0x1cb/0x430
Feb 08 23:58:43 proxmoxt kernel:  dmu_object_alloc_dnsize+0x1f/0x40 [zfs]
Feb 08 23:58:43 proxmoxt kernel:  zfs_mknode+0x1de/0x1020 [zfs]
Feb 08 23:58:43 proxmoxt kernel:  zfs_create+0x774/0xa20 [zfs]
Feb 08 23:58:43 proxmoxt kernel:  zpl_create+0xca/0x1e0 [zfs]
Feb 08 23:58:43 proxmoxt kernel:  path_openat+0xec9/0x1190
Feb 08 23:58:43 proxmoxt kernel:  do_filp_open+0xaf/0x170
Feb 08 23:58:43 proxmoxt kernel:  do_sys_openat2+0xb3/0xe0
Feb 08 23:58:43 proxmoxt kernel:  __x64_sys_openat+0x6c/0xa0
Feb 08 23:58:43 proxmoxt kernel:  x64_sys_call+0x17cd/0x2480
Feb 08 23:58:43 proxmoxt kernel:  do_syscall_64+0x81/0x170
Feb 08 23:58:43 proxmoxt kernel:  ? do_syscall_64+0x8d/0x170
Feb 08 23:58:43 proxmoxt kernel:  ? __mod_memcg_lruvec_state+0x87/0x140
Feb 08 23:58:43 proxmoxt kernel:  ? __mod_lruvec_state+0x36/0x50
Feb 08 23:58:43 proxmoxt kernel:  ? __lruvec_stat_mod_folio+0x70/0xc0
Feb 08 23:58:43 proxmoxt kernel:  ? xas_find+0x6e/0x1d0
Feb 08 23:58:43 proxmoxt kernel:  ? next_uptodate_folio+0x93/0x290
Feb 08 23:58:43 proxmoxt kernel:  ? filemap_map_pages+0x4b8/0x5b0
Feb 08 23:58:43 proxmoxt kernel:  ? __fput+0x15e/0x2e0
Feb 08 23:58:43 proxmoxt kernel:  ? do_fault+0x26a/0x4f0
Feb 08 23:58:43 proxmoxt kernel:  ? __handle_mm_fault+0x894/0xf70
Feb 08 23:58:43 proxmoxt kernel:  ? do_syscall_64+0x8d/0x170
Feb 08 23:58:43 proxmoxt kernel:  ? __count_memcg_events+0x6f/0xe0
Feb 08 23:58:43 proxmoxt kernel:  ? count_memcg_events.constprop.0+0x2a/0x50
Feb 08 23:58:43 proxmoxt kernel:  ? handle_mm_fault+0xad/0x380
Feb 08 23:58:43 proxmoxt kernel:  ? do_user_addr_fault+0x33e/0x660
Feb 08 23:58:43 proxmoxt kernel:  ? irqentry_exit_to_user_mode+0x7b/0x260
Feb 08 23:58:43 proxmoxt kernel:  ? irqentry_exit+0x43/0x50
Feb 08 23:58:43 proxmoxt kernel:  ? exc_page_fault+0x94/0x1b0
Feb 08 23:58:43 proxmoxt kernel:  entry_SYSCALL_64_after_hwframe+0x78/0x80
Feb 08 23:58:43 proxmoxt kernel: RIP: 0033:0x74975bc16000
Feb 08 23:58:43 proxmoxt kernel: Code: 48 89 44 24 20 75 93 44 89 54 24 0c e8 39 d8 f8 ff 44 8b 54 24 0c 89 da 48 89 ee 41 89 c0 bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 38 44 89 c7 89 44 2>
Feb 08 23:58:43 proxmoxt kernel: RSP: 002b:00007ffceb7ac290 EFLAGS: 00000293 ORIG_RAX: 0000000000000101
Feb 08 23:58:43 proxmoxt kernel: RAX: ffffffffffffffda RBX: 00000000000800c2 RCX: 000074975bc16000
Feb 08 23:58:43 proxmoxt kernel: RDX: 00000000000800c2 RSI: 0000616c6ffe6cc0 RDI: 00000000ffffff9c
Feb 08 23:58:43 proxmoxt kernel: RBP: 0000616c6ffe6cc0 R08: 0000000000000000 R09: 0000000000000001
Feb 08 23:58:43 proxmoxt kernel: R10: 0000000000000180 R11: 0000000000000293 R12: 8421084210842109
Feb 08 23:58:43 proxmoxt kernel: R13: 0000616c6ffe6cf0 R14: 000074975bcb8560 R15: 00000000000aecd0
Feb 08 23:58:43 proxmoxt kernel:  </TASK>

In the meantime we setup a new server (other housing, other power adapter, other main board, other CPU, other RAM (also tested with memtest86+ beforehand)), just took over the four HDD's with PVE installation and the ZFS pool. The issue persists.

SMART values of the four HDD's do look fine. Nevertheless I did start to exchange the first HDD after the event this weekend. Resivering did work fine.

Any advice how to proceed?

BR, Jens

fabian · Feb 10, 2025

if you previously had RAM issues, and the pool was written during that time, it's entirely possible that the on-disk structures are corrupt in some places.. the assert that fails checks the "magic value" of the xattr part of the file.. it's an assert for a reason - it's not supposed to fail ever, unless something is corrupt and then all bets are off..

LostInTranslation · Feb 10, 2025

Ok. There were RAM issues some month ago. I already did a zpool scub several times since then. Also the panic is not happening with every backup (if it is a corruption, it should happen always, right?).
Whats your recommendation in this case?

fabian · Feb 10, 2025

recover as much data as possible and start over with a fresh pool.. in general, after you've run a system with faulty memory, you cannot really tell what might have been corrupted/broken as a result.

LostInTranslation · Feb 12, 2025

Just one last question/check: my understanding was, that zfs scrub (which was done several times) shoud find and at least report such corrputions ... Is my understanding wrong?

fabian · Feb 12, 2025

ZFS scrub will check checksums of dnodes/blocks, I am not sure whether it will actually try to read xattrs on the semantic level

Search

Search

PANIC at zfs_quota.c:88:zpl_get_file_info()

LostInTranslation

Member

LostInTranslation

Member

fabian

Proxmox Staff Member

LostInTranslation

Member

fabian

Proxmox Staff Member

LostInTranslation

Member

fabian

Proxmox Staff Member

We value your privacy