PANIC: rpool: blkptr at ... DVA 0 has invalid OFFSET 18388167655883276288

niziak

Member
Apr 18, 2020
17
5
8
44
Hello.
I reported ZFS issue here: PANIC: rpool: blkptr at ... DVA 0 has invalid OFFSET 18388167655883276288 #12019

The IO delay on node is rising from minute to minute. After some hours node stop responding completely. Service in RAM (like ceph) are still running.
After long time cluster shows this node as unavailable but it responds to ping, accept tcp connection to ssh port (no login possible). it requires manual reset to bring it back to life for short while until one of VM touched problematic ZFS area.

I know that no software is perfect and OpenZFS raise kernel PANIC. But why node is not reboted with kernel cmdline contain panic=30 ?
Additionally I think, in this case local watchdog should detect this issue and reboots the node.
 
My findings:
  • There is no tool to repair ZFS. It is planned somewhere in future.
  • Scrub only validates checksums. In this case incorrect data was stored correctly on VDEVs so scrub cannot help.
  • Sometimes, during zdb check read error appears:
    Code:
    db_blkptr_cb: Got error 52 reading <259, 75932, 0, 17> DVA[0]=<0:158039e9000:6000> [L0 ZFS plain file] fletcher4 lz4 unencrypted LE contiguous unique single size=20000L/6000P birth=62707L/62707P fill=1 cksum=516dd1ace1c:414cbfc202333b:af36411a2766c4f:7bc4d6777673687b -- skipping
    Digging in ZFS drivers shows that error 52 is ECKSUM from SPL. Poor HW ?
  • and one important: PANIC from SPL is not linux panic ! This is why system stuck. Some debug code is left in SPL and I think this code should be removed by Proxmox team for production use:
C:
vcmn_err(int ce, const char *fmt, va_list ap)
{
....
case CE_PANIC:        printk(KERN_EMERG "PANIC: %s\n", msg);
        spl_dumpstack();

        /* Halt the thread to facilitate further debugging */
        set_current_state(TASK_UNINTERRUPTIBLE);
        while (1)
            schedule();
    }
there is spl module parameter spl_panic_halt which should enable real kernel PANIC, but this code is added to function spl_panic but not to function: vcmn_err:
C:
int spl_panic(const char *file, const char *func, int line, const char *fmt, ...)
{
    ...
    printk(KERN_EMERG "%s", msg);
    printk(KERN_EMERG "PANIC at %s:%d:%s()\n", newfile, line, func);
    if (spl_panic_halt)
        panic("%s", msg);
    spl_dumpstack();
    /* Halt the thread to facilitate further debugging */
    set_current_state(TASK_UNINTERRUPTIBLE);
    while (1)
        schedule();
    /* Unreachable */
    return (1);

Ready patch: Consistent SPL module param 'spl_panic_halt' handling. #12120
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!