PANIC: rpool: blkptr at ... DVA 0 has invalid OFFSET 18388167655883276288

niziak · May 10, 2021

Hello.
I reported ZFS issue here: PANIC: rpool: blkptr at ... DVA 0 has invalid OFFSET 18388167655883276288 #12019

The IO delay on node is rising from minute to minute. After some hours node stop responding completely. Service in RAM (like ceph) are still running.
After long time cluster shows this node as unavailable but it responds to ping, accept tcp connection to ssh port (no login possible). it requires manual reset to bring it back to life for short while until one of VM touched problematic ZFS area.

I know that no software is perfect and OpenZFS raise kernel PANIC. But why node is not reboted with kernel cmdline contain panic=30 ?
Additionally I think, in this case local watchdog should detect this issue and reboots the node.

niziak · May 24, 2021

My findings:

There is no tool to repair ZFS. It is planned somewhere in future.
Scrub only validates checksums. In this case incorrect data was stored correctly on VDEVs so scrub cannot help.

Sometimes, during zdb check read error appears:

Code:

db_blkptr_cb: Got error 52 reading <259, 75932, 0, 17> DVA[0]=<0:158039e9000:6000> [L0 ZFS plain file] fletcher4 lz4 unencrypted LE contiguous unique single size=20000L/6000P birth=62707L/62707P fill=1 cksum=516dd1ace1c:414cbfc202333b:af36411a2766c4f:7bc4d6777673687b -- skipping

Digging in ZFS drivers shows that error 52 is ECKSUM from SPL. Poor HW ?

and one important: PANIC from SPL is not linux panic ! This is why system stuck. Some debug code is left in SPL and I think this code should be removed by Proxmox team for production use:

C:

vcmn_err(int ce, const char *fmt, va_list ap)
{
....
case CE_PANIC:        printk(KERN_EMERG "PANIC: %s\n", msg);
        spl_dumpstack();

        /* Halt the thread to facilitate further debugging */
        set_current_state(TASK_UNINTERRUPTIBLE);
        while (1)
            schedule();
    }

there is spl module parameter spl_panic_halt which should enable real kernel PANIC, but this code is added to function spl_panic but not to function: vcmn_err:

C:

int spl_panic(const char *file, const char *func, int line, const char *fmt, ...)
{
    ...
    printk(KERN_EMERG "%s", msg);
    printk(KERN_EMERG "PANIC at %s:%d:%s()\n", newfile, line, func);
    if (spl_panic_halt)
        panic("%s", msg);
    spl_dumpstack();
    /* Halt the thread to facilitate further debugging */
    set_current_state(TASK_UNINTERRUPTIBLE);
    while (1)
        schedule();
    /* Unreachable */
    return (1);

Ready patch: Consistent SPL module param 'spl_panic_halt' handling. #12120

Search

Search

PANIC: rpool: blkptr at ... DVA 0 has invalid OFFSET 18388167655883276288

niziak

Active Member

niziak

Active Member

We value your privacy