PANIC at zfs_quota.c:88:zpl_get_file_info()

Apr 3, 2022
14
0
6
Hi,

I had some issues with file system errors some time ago (see messages here). Since there was no clear evidence I decided to exchange the RAM modules in that server to new ECC RAM modules four weeks ago. Since then the file system problems are gone, but I did have two occurences of the following error message (one on 27.12.24 and one last night):

Code:
Jan 09 00:42:46 proxmoxt kernel: VERIFY3(sa.sa_magic == SA_MAGIC) failed (8192 == 3100762)
Jan 09 00:42:46 proxmoxt kernel: PANIC at zfs_quota.c:88:zpl_get_file_info()

No login at the server possible any more.

The zpool has a plenty of space left (only about 10% used at all).

Unfortunately I do not find anything helpful in the logs.

Any advice how to narrow down the problem?

BR,
Jens
 
Dear all,

we are still fighting with this issue. We do see an completely unresponsive proxmox server every 2-3 weeks at the moment (typically at a time starting the backups to a proxmox backup sreporting the following lines in journalctl:

Code:
Feb 08 23:58:43 proxmoxt kernel: VERIFY3(sa.sa_magic == SA_MAGIC) failed (8192 == 3100762)
Feb 08 23:58:43 proxmoxt kernel: PANIC at zfs_quota.c:88:zpl_get_file_info()
Feb 08 23:58:43 proxmoxt kernel: Showing stack for process 1864533
Feb 08 23:58:43 proxmoxt kernel: CPU: 2 PID: 1864533 Comm: proxmox-backup- Tainted: P          IO       6.8.12-7-pve #1
Feb 08 23:58:43 proxmoxt kernel: Hardware name: Dell Inc. Precision WorkStation T3500  /09KPNV, BIOS A17 05/28/2013
Feb 08 23:58:43 proxmoxt kernel: Call Trace:
Feb 08 23:58:43 proxmoxt kernel:  <TASK>
Feb 08 23:58:43 proxmoxt kernel:  dump_stack_lvl+0x76/0xa0
Feb 08 23:58:43 proxmoxt kernel:  dump_stack+0x10/0x20
Feb 08 23:58:43 proxmoxt kernel:  spl_dumpstack+0x29/0x40 [spl]
Feb 08 23:58:43 proxmoxt kernel:  spl_panic+0xfc/0x120 [spl]
Feb 08 23:58:43 proxmoxt kernel:  ? dnode_cons+0x2ab/0x2d0 [zfs]
Feb 08 23:58:43 proxmoxt kernel:  zpl_get_file_info+0x23a/0x250 [zfs]
Feb 08 23:58:43 proxmoxt kernel:  dmu_objset_userquota_get_ids+0x257/0x4c0 [zfs]
Feb 08 23:58:43 proxmoxt kernel:  dnode_setdirty+0x38/0x110 [zfs]
Feb 08 23:58:43 proxmoxt kernel:  dnode_allocate+0x16b/0x1f0 [zfs]
Feb 08 23:58:43 proxmoxt kernel:  dmu_object_alloc_impl+0x36e/0x420 [zfs]
Feb 08 23:58:43 proxmoxt kernel:  ? __kmalloc_node+0x1cb/0x430
Feb 08 23:58:43 proxmoxt kernel:  dmu_object_alloc_dnsize+0x1f/0x40 [zfs]
Feb 08 23:58:43 proxmoxt kernel:  zfs_mknode+0x1de/0x1020 [zfs]
Feb 08 23:58:43 proxmoxt kernel:  zfs_create+0x774/0xa20 [zfs]
Feb 08 23:58:43 proxmoxt kernel:  zpl_create+0xca/0x1e0 [zfs]
Feb 08 23:58:43 proxmoxt kernel:  path_openat+0xec9/0x1190
Feb 08 23:58:43 proxmoxt kernel:  do_filp_open+0xaf/0x170
Feb 08 23:58:43 proxmoxt kernel:  do_sys_openat2+0xb3/0xe0
Feb 08 23:58:43 proxmoxt kernel:  __x64_sys_openat+0x6c/0xa0
Feb 08 23:58:43 proxmoxt kernel:  x64_sys_call+0x17cd/0x2480
Feb 08 23:58:43 proxmoxt kernel:  do_syscall_64+0x81/0x170
Feb 08 23:58:43 proxmoxt kernel:  ? do_syscall_64+0x8d/0x170
Feb 08 23:58:43 proxmoxt kernel:  ? __mod_memcg_lruvec_state+0x87/0x140
Feb 08 23:58:43 proxmoxt kernel:  ? __mod_lruvec_state+0x36/0x50
Feb 08 23:58:43 proxmoxt kernel:  ? __lruvec_stat_mod_folio+0x70/0xc0
Feb 08 23:58:43 proxmoxt kernel:  ? xas_find+0x6e/0x1d0
Feb 08 23:58:43 proxmoxt kernel:  ? next_uptodate_folio+0x93/0x290
Feb 08 23:58:43 proxmoxt kernel:  ? filemap_map_pages+0x4b8/0x5b0
Feb 08 23:58:43 proxmoxt kernel:  ? __fput+0x15e/0x2e0
Feb 08 23:58:43 proxmoxt kernel:  ? do_fault+0x26a/0x4f0
Feb 08 23:58:43 proxmoxt kernel:  ? __handle_mm_fault+0x894/0xf70
Feb 08 23:58:43 proxmoxt kernel:  ? do_syscall_64+0x8d/0x170
Feb 08 23:58:43 proxmoxt kernel:  ? __count_memcg_events+0x6f/0xe0
Feb 08 23:58:43 proxmoxt kernel:  ? count_memcg_events.constprop.0+0x2a/0x50
Feb 08 23:58:43 proxmoxt kernel:  ? handle_mm_fault+0xad/0x380
Feb 08 23:58:43 proxmoxt kernel:  ? do_user_addr_fault+0x33e/0x660
Feb 08 23:58:43 proxmoxt kernel:  ? irqentry_exit_to_user_mode+0x7b/0x260
Feb 08 23:58:43 proxmoxt kernel:  ? irqentry_exit+0x43/0x50
Feb 08 23:58:43 proxmoxt kernel:  ? exc_page_fault+0x94/0x1b0
Feb 08 23:58:43 proxmoxt kernel:  entry_SYSCALL_64_after_hwframe+0x78/0x80
Feb 08 23:58:43 proxmoxt kernel: RIP: 0033:0x74975bc16000
Feb 08 23:58:43 proxmoxt kernel: Code: 48 89 44 24 20 75 93 44 89 54 24 0c e8 39 d8 f8 ff 44 8b 54 24 0c 89 da 48 89 ee 41 89 c0 bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 38 44 89 c7 89 44 2>
Feb 08 23:58:43 proxmoxt kernel: RSP: 002b:00007ffceb7ac290 EFLAGS: 00000293 ORIG_RAX: 0000000000000101
Feb 08 23:58:43 proxmoxt kernel: RAX: ffffffffffffffda RBX: 00000000000800c2 RCX: 000074975bc16000
Feb 08 23:58:43 proxmoxt kernel: RDX: 00000000000800c2 RSI: 0000616c6ffe6cc0 RDI: 00000000ffffff9c
Feb 08 23:58:43 proxmoxt kernel: RBP: 0000616c6ffe6cc0 R08: 0000000000000000 R09: 0000000000000001
Feb 08 23:58:43 proxmoxt kernel: R10: 0000000000000180 R11: 0000000000000293 R12: 8421084210842109
Feb 08 23:58:43 proxmoxt kernel: R13: 0000616c6ffe6cf0 R14: 000074975bcb8560 R15: 00000000000aecd0
Feb 08 23:58:43 proxmoxt kernel:  </TASK>

In the meantime we setup a new server (other housing, other power adapter, other main board, other CPU, other RAM (also tested with memtest86+ beforehand)), just took over the four HDD's with PVE installation and the ZFS pool. The issue persists.

SMART values of the four HDD's do look fine. Nevertheless I did start to exchange the first HDD after the event this weekend. Resivering did work fine.

Any advice how to proceed?

BR, Jens
 
Last edited:
if you previously had RAM issues, and the pool was written during that time, it's entirely possible that the on-disk structures are corrupt in some places.. the assert that fails checks the "magic value" of the xattr part of the file.. it's an assert for a reason - it's not supposed to fail ever, unless something is corrupt and then all bets are off..
 
Ok. There were RAM issues some month ago. I already did a zpool scub several times since then. Also the panic is not happening with every backup (if it is a corruption, it should happen always, right?).
Whats your recommendation in this case?
 
Last edited:
recover as much data as possible and start over with a fresh pool.. in general, after you've run a system with faulty memory, you cannot really tell what might have been corrupted/broken as a result.
 
Just one last question/check: my understanding was, that zfs scrub (which was done several times) shoud find and at least report such corrputions ... Is my understanding wrong?
 
ZFS scrub will check checksums of dnodes/blocks, I am not sure whether it will actually try to read xattrs on the semantic level