PANIC: * invalid TYPE 140 (My confidence in ZFS on Linux for business use is is strongly strained)

May 13, 2020
1
0
1
60
Hi all,

what happened?

A PVE server commissioned in 2018 ceased service after almost 2 years of trouble-free continuous operation; server access was no longer possible, neither PVE-GUI nor other services; ping was still possible.

A login to the physical console was still successful. The subsequent shutdown got stuck on shutdown of the VMs…

2020-05-10_0916_600px.jpg

Now only a reset remained as an option.

The subsequent boot process will hang when the root system is importing rpool.

2020-05-11_1203_600px.jpg

Shortly after that a lot of kernel messages appear every two minutes:

2020-05-11_1204_600px.jpg

After an initial hardware check, a hardware problem can be ruled out.

rpool is on a mirror of 2x SSD Samsung 850 Pro 1TB, another pool on a raidz2 with 4x 4TB HDD, no ZIL, no L2ARC.

CPU: AMD FX-8350
RAM: 32GB DDR3 ECC
Proxmox, probably PVE 5.2
Kernel: 4.15.18-21-pve

The SSDs are okay according to smartctl.

The big problem: rpool can no longer be imported in any way that I know of, even though a zpool import shows an intact pool:

Code:
# zpool import
   pool: rpool
     id: 9373167444002024865
  state: ONLINE
status: The pool was last accessed by another system.
action: The pool can be imported using its name or numeric identifier and
    the '-f' flag.
   see: http://zfsonlinux.org/msg/ZFS-8000-EY
config:

    rpool       ONLINE
      mirror-0  ONLINE
        sda2    ONLINE
        sdc2    ONLINE


A zpool import -f rpool does not end with a normal error message, but instead triggers a kernel panic:

May 11 13:12:38 pve-resc kernel: [ 5314.207156] PANIC: blkptr at 000000004ab2be1f has invalid TYPE 140
May 11 13:12:38 pve-resc kernel: [ 5314.207161] Showing stack for process 24099
May 11 13:12:38 pve-resc kernel: [ 5314.207168] CPU: 2 PID: 24099 Comm: txg_sync Tainted: P O 5.0.0-32-generic #34~18.04.2-Ubuntu
May 11 13:12:38 pve-resc kernel: [ 5314.207169] Hardware name: System manufacturer System Product Name/M5A78L-M/USB3, BIOS 2101 12/02/2014
May 11 13:12:38 pve-resc kernel: [ 5314.207171] Call Trace:
May 11 13:12:38 pve-resc kernel: [ 5314.207180] dump_stack+0x63/0x85
May 11 13:12:38 pve-resc kernel: [ 5314.207194] spl_dumpstack+0x42/0x50 [spl]
May 11 13:12:38 pve-resc kernel: [ 5314.207202] vcmn_err+0xc3/0x100 [spl]
May 11 13:12:38 pve-resc kernel: [ 5314.207208] ? _cond_resched+0x19/0x40
May 11 13:12:38 pve-resc kernel: [ 5314.207212] ? __kmalloc+0x62/0x210
May 11 13:12:38 pve-resc kernel: [ 5314.207215] ? sg_kmalloc+0x19/0x30
May 11 13:12:38 pve-resc kernel: [ 5314.207217] ? sg_init_table+0x15/0x40
May 11 13:12:38 pve-resc kernel: [ 5314.207219] ? __sg_alloc_table+0x9b/0x160
May 11 13:12:38 pve-resc kernel: [ 5314.207220] ? sg_zero_buffer+0xc0/0xc0
May 11 13:12:38 pve-resc kernel: [ 5314.207307] zfs_panic_recover+0x69/0x90 [zfs]
May 11 13:12:38 pve-resc kernel: [ 5314.207346] ? abd_alloc+0x2cd/0x480 [zfs]
May 11 13:12:38 pve-resc kernel: [ 5314.207386] ? arc_read+0xa60/0xa60 [zfs]
May 11 13:12:38 pve-resc kernel: [ 5314.207440] zfs_blkptr_verify+0xfc/0x3a0 [zfs]
May 11 13:12:38 pve-resc kernel: [ 5314.207442] ? _cond_resched+0x19/0x40
May 11 13:12:38 pve-resc kernel: [ 5314.207497] zio_read+0x34/0xa0 [zfs]
May 11 13:12:38 pve-resc kernel: [ 5314.207537] ? arc_read+0xa60/0xa60 [zfs]
May 11 13:12:38 pve-resc kernel: [ 5314.207577] arc_read+0x5ff/0xa60 [zfs]
May 11 13:12:38 pve-resc kernel: [ 5314.207617] ? arc_buf_destroy+0x140/0x140 [zfs]
May 11 13:12:38 pve-resc kernel: [ 5314.207668] dsl_scan_prefetch.isra.8+0xb7/0xd0 [zfs]
May 11 13:12:38 pve-resc kernel: [ 5314.207718] dsl_scan_visitbp+0x3c6/0xd60 [zfs]
May 11 13:12:38 pve-resc kernel: [ 5314.207769] dsl_scan_visitbp+0x487/0xd60 [zfs]
May 11 13:12:38 pve-resc kernel: [ 5314.207820] dsl_scan_visitbp+0x7c5/0xd60 [zfs]
May 11 13:12:38 pve-resc kernel: [ 5314.207871] dsl_scan_visitbp+0x487/0xd60 [zfs]
May 11 13:12:38 pve-resc kernel: [ 5314.207922] dsl_scan_visitbp+0x487/0xd60 [zfs]
May 11 13:12:38 pve-resc kernel: [ 5314.207973] dsl_scan_visitbp+0x487/0xd60 [zfs]
May 11 13:12:38 pve-resc kernel: [ 5314.208024] dsl_scan_visitbp+0x487/0xd60 [zfs]
May 11 13:12:38 pve-resc kernel: [ 5314.208075] dsl_scan_visitbp+0x487/0xd60 [zfs]
May 11 13:12:38 pve-resc kernel: [ 5314.208127] dsl_scan_visitbp+0x97b/0xd60 [zfs]
May 11 13:12:38 pve-resc kernel: [ 5314.208178] dsl_scan_visitds+0x10c/0x4e0 [zfs]
May 11 13:12:38 pve-resc kernel: [ 5314.208231] dsl_scan_sync+0x2ef/0xb90 [zfs]
May 11 13:12:38 pve-resc kernel: [ 5314.208288] ? zio_destroy+0xbc/0xc0 [zfs]
May 11 13:12:38 pve-resc kernel: [ 5314.208342] ? zio_wait+0x147/0x1b0 [zfs]
May 11 13:12:38 pve-resc kernel: [ 5314.208397] spa_sync+0x49e/0xd30 [zfs]
May 11 13:12:38 pve-resc kernel: [ 5314.208455] txg_sync_thread+0x2cd/0x4a0 [zfs]
May 11 13:12:38 pve-resc kernel: [ 5314.208458] ? __switch_to_asm+0x35/0x70
May 11 13:12:38 pve-resc kernel: [ 5314.208514] ? txg_quiesce_thread+0x3d0/0x3d0 [zfs]
May 11 13:12:38 pve-resc kernel: [ 5314.208521] thread_generic_wrapper+0x74/0x90 [spl]
May 11 13:12:38 pve-resc kernel: [ 5314.208525] kthread+0x121/0x140
May 11 13:12:38 pve-resc kernel: [ 5314.208532] ? __thread_exit+0x20/0x20 [spl]
May 11 13:12:38 pve-resc kernel: [ 5314.208534] ? kthread_park+0xb0/0xb0
May 11 13:12:38 pve-resc kernel: [ 5314.208537] ret_from_fork+0x22/0x40


No other behavior:
zpool import -f -F -m -N -o cachefile=none rpool

This was tested on different hardware under different kernel versions – last under the current PVE-6.2-1

The following message is conspicuous:

PANIC: blkptr at 000000004ab2be1f has invalid TYPE 140

I have found absolutely no clues on the net in connection with ZFS.

Have we come across a new bug in the ZFS universe here?

Who has a helpful idea to get the server up and running again?

:eek:
 
> CPU: 2 PID: 24099 Comm: txg_sync Tainted: P O 5.0.0-32-generic #34~18.04.2-Ubuntu

Ubuntu Kernel?

Try to import with the Proxmox VE ISO (Debug mode), I am not sure if the Ubuntu 18.04 is working.

See also https://pve.proxmox.com/wiki/Debugging_Installation

Besides that, you missed a lot of important updates recently and you run a quite old desktop hardware with cheap consumer SSDs - I just mention this because you talk about "business use".
 
Your cpu doesnt support buffered ecc ram fyi.

That definitely looks like a hardware problem.

Did you try booting just with a single disk ? Try both out.

Make sure cables are ok.

Updating proxmox could also help, there is a much newer zfs version available.